Reading view

There are new articles available, click to refresh the page.

Is it worth storing Time Machine backups on a faster drive?

18 December 2024 at 15:30

Folk wisdom is that there’s little point in wasting a faster drive to store your Time Machine backups, because when those backups are made, their write speed is throttled by macOS. This article considers how true that might be, and what benefits there might be to using faster solid-state storage.

I/O Throttling

Documentation of I/O policy is in the man page for the getiopolicy_np() call in the Standard C Library, last revised in 2019, and prevailing settings are in sysctl’s debug.lowpri_throttle_* values. These draw a distinction between I/O to local disks, being “I/O sent to the media without going through a network” such as I/O “to internal and external hard drives, optical media in internal and external drives, flash drives, floppy disks, ram disks, and mounted disk images which reside on these media”, and those to “remote volumes” “that require network activity to complete the operation”. The latter are “currently only supported for remote volumes mounted by SMB or AFP.”

Inclusion of remote volumes is a relatively recent change, as in the previous version of this man page from 2006, they were explicitly excluded as “remote volumes mounted through networks (AFP, SMB, NFS, etc) or disk images residing on remote volumes.”

Five policy levels are supported:

IOPOL_IMPORTANT, the default, where I/O is critical to system responsiveness.
IOPOL_STANDARD, which may be delayed slightly to allow IOPOL_IMPORTANT to complete quickly, and presumably referred to in sysctl as Tier 1.
IOPOL_UTILITY, for brief background threads that may be throttled to prevent impact on higher policy levels, in Tier 2.
IOPOL_THROTTLE, for “long-running I/O intensive work, such as backups, search indexing, or file synchronization”, that will be throttled to prevent impact on higher policy levels, in Tier 3.
IOPOL_PASSIVE, for mounting files from disk images, and the like, intended more for server situations so that lower policy levels aren’t slowed by them.

However, the idea that throttled I/O is intentionally slowed at all times isn’t supported by the explanation of how throttling works: “If a throttleable request occurs within a small time window of a request of higher priority, the thread that issued the throttleable I/O is forced to a sleep for a short period. This slows down the thread that issues the throttleable I/O so that higher-priority I/Os can complete with low-latency and receive a greater share of the disk bandwidth.”

Settings in sysctl for Tier 3 give the window duration as 500 milliseconds, and the sleep period as 200 ms, except for SSDs, whose sleep period is considerably shorter at just 25 ms. Those also set a maximum size for I/O at 131,072 bytes. You can view those settings in the debug section of Mints’ sysctl viewer.

Some years ago, it was discovered that the user can globally disable IOPOL_THROTTLE and presumably all other throttling policy with the command
sudo sysctl debug.lowpri_throttle_enabled=0
although that doesn’t persist across restarts, and isn’t documented in the man page for sysctl. This is provided in an option in St. Clair Software’s App Tamer, to “Accelerate Time Machine backups”, for those who’d rather avoid the command line.

Performance checks

Before each Time Machine backup starts, backupd runs two checks of disk performance, by writing one 50 MB file and 500 4 KB files to the backup volume, reported in the log as
Checking destination IO performance at "/Volumes/ThunderBay2" Wrote 1 50 MB file at 286.74 MB/s to "/Volumes/ThunderBay2" in 0.174 seconds Concurrently wrote 500 4 KB files at 23.17 MB/s to "/Volumes/ThunderBay2" in 0.088 seconds

With normal throttling in force, there’s surprisingly wide variation that appears only weakly related to underlying SSD performance, as shown in the table below.

tmbackupssds2

Results from the three first full backups are far greater than those for subsequent automatic backups, with single-file tests recording 900-1100 MB/s, and multi-file tests 75-80 MB/s. Routine backups after those ranged from 190-365 MB/s and 3-32 MB/s, less than half. Most extraordinary are the results for the OWC Express 1M2 enclosure when used with an M3 Pro: on the first full backup, the 50 MB test achieved 910 MB/s, but an hour later during the first automatic backup that fell to 191 MB/s.

tmfirstbackup

Evidence suggests that in macOS Sequoia, at least, first full backups may now run differently, and make more use of P cores, as shown in the CPU History window above. In comparison, automatic backups are confined to the E cores. However, the backup itself still runs at a tenth of the measured write speed of the SSD.

Time Machine runs the first full backup as a ‘manual backup’ rather than scheduling it through the DAS-CTS dispatching mechanism. It’s therefore plausible that initial backups aren’t run at a Background QoS, so given access to P cores, and at a higher I/O throttling policy than IOPOL_THROTTLE, so being given higher priority access to I/O.

Changing policy

When tested here previously, disabling throttling policy appeared to have little effect on Time Machine’s initial performance checks, but transfer rates during backups were as much as 150% of those achieved when throttling policy was in force.

The big disadvantage of completely disabling throttling policy is that this can only be applied globally, so including I/O for Spotlight indexing, file synchronisation, other background tasks and those needing high-priority access to I/O. Leaving the policy disabled in normal circumstances could readily lead to adverse side-effects, allowing Spotlight indexing threads to swamp important user storage access, for example.

SSD or hard disk?

Despite wide variation in test results, all those for the 50 MB file were far in excess of the highest write speed measured for any equivalent hard disk at 148 MB/s (for the outer disk). Overall backup speeds for the first full backup to a USB4 SSD were also double that hard disk write speed. When throttling does occur, the period of sleep enforced on writes to a hard disk is eight times longer than that for an SSD, making the impact of I/O throttling greater for a hard drive than an SSD.

Another good reason for preferring SSD rather than hard disk storage is the use of APFS, and its tendency over time to result in severe fragmentation in the file system metadata on hard disks. Provided that an SSD has Trim support it should continue to perform well for many years.

Conclusions

macOS applies throttling policies on I/O with both local and networked storage to ensure that I/O-intensive tasks like backing up don’t block higher priority access.
Disabling those policies isn’t wise because of its general effects.
Those policies use different settings for I/O to SSDs from those for other types of storage to allow for their different performance.
First full backups to SSD appear to run at higher priority to allow them to complete more quickly.
As a result, effective write speeds to SSDs during Time Machine backups are significantly faster than could be achieved to hard disks.
When the consequences of using APFS on hard disks are taken into account, storing Time Machine backups on SSD has considerable advantages in terms of both speed and longevity.

Tune for Performance: Core types

The Eclectic Light Company

hoakley

17 December 2024 at 15:30

When running apps on Intel Macs, because all their CPU cores are identical, the best you can hope for is that tasks make use of the number of cores available by running in multiple threads. Even single-threaded processes running on Apple silicon Macs get a choice of CPU core type, and this article explains the limited control you have over that.

m4coremanagement1

In my current model of CPU core allocation by macOS, shown in part above, Quality of Service (QoS) is the factor determining which core type a thread will be allocated to. QoS is normally hard-coded into an app, and seldom exposed to the user’s control.

If a thread is assigned a minimal QoS of Background (value 9), or less, then macOS will allocate it to be run on E cores, and it won’t be run on a P core. With a higher QoS, macOS allocates it to a P core if one is available; if not, then it will be run on E cores, with that cluster running at high frequency. Thus, threads with a higher QoS can be run on either P or E cores, depending on their availability.

`taskpolicy`

There are times when you might wish to accelerate completion of threads normally run exclusively on E cores. For example, knowing that a particular backup might be large, you might elect to leave the Mac to get on with that as quickly as possible. There are two methods that appear intended to change the QoS of processes: the command tool taskpolicy, and its equivalent code function setpriority().

Experience with using those demonstrates that, while they can be used to demote threads to E cores, they can’t promote processes or threads already confined to E cores so that they can use both types. For instance, the command
taskpolicy -B -p 567
that should promote the process with PID 567 to run on both types of core, has no effect on processes or their threads that are run at low QoS. taskpolicy can be used to demote processes and their threads with higher QoS to use only E cores, though. Running the command
taskpolicy -b -p 567
does confine all threads to the E cluster, and can be reversed using the -B option for threads with higher QoS (but not those set to low QoS by the process).

qoscores1

That can be seen in this CPU History window from Activity Monitor. An app has run four threads, two at low QoS and two at high QoS. In the left side of each core trace they are run on their respective cores, as set by their QoS. The app’s process was then changed using taskpolicy -b and the threads run again, as seen in the right. The two threads with high QoS are then run together with the two with low QoS in the four E cores alone.

The best way to take advantage of this ability to demote high QoS threads to run them on E cores is in St. Clair Software’s excellent utility App Tamer.

Virtualisation

macOS virtual machines running on Apple silicon chips are automatically assigned a high QoS, and run preferentially on P cores. Thus, even when running threads at low QoS, those are run within threads on the host’s P cores. This remains the only known method of electively running low QoS threads on P cores.

Game Mode, CPU cores

In the Info.plist of an application, the developer should assign that app to one of the standard LSApplicationCategoryTypes. If that’s one of the named game types, macOS automatically gives that app access to Game Mode. Among its benefits are “highest priority access” to the CPU cores, in particular the E cores, whose use by background threads is reduced. Other benefits include highest priority access to the GPU, and doubled Bluetooth sampling rate to reduce latency for input controllers and audio output. Game Mode is automatically turned on when the app is put into full-screen mode, and turned off when full-screen mode is stopped, although the user also has manual control through the mode’s item in the menu bar.

In practice, while this is beneficial to many games, it has little if any use for modifying core type allocation in other circumstances. As the user can’t modify an app’s Info.plist without breaking its signature and notarization, this is only of use to developers.

Summary

The QoS assigned by an app to its threads is used by macOS to determine which core type to allocate them to.
Threads with a low (Background, 9) QoS are run exclusively on E cores. Those with higher QoS are run preferentially on P cores, and normally only on E cores when no P core is available.
App Tamer and taskpolicy can be used to demote higher QoS threads to be run on E cores, but low QoS threads can’t be promoted to be run on P cores.
macOS VMs run on P cores, so low QoS threads running in a VM will be run on P cores, and can be used to accelerate Background threads.
Game Mode gives games priority access to CPU cores, particularly to E cores, but users can’t elect to run other apps in Game Mode, and there don’t appear to be benefits in doing so.
If you feel an app would benefit from user control over CPU core type allocation through access to their QoS, suggest it to the app’s developer.

Explainer

Quality of Service (QoS) is a property of each thread in a process, and normally chosen from the macOS standard list:

QoS 9 (binary 001001), named background and intended for threads performing maintenance, which don’t need to be run with any higher priority.
QoS 17 (binary 010001), utility, for tasks the user doesn’t track actively.
QoS 25 (binary 011001), userInitiated, for tasks that the user needs to complete to be able to use the app.
QoS 33 (binary 100001), userInteractive, for user-interactive tasks, such as handling events and the app’s interface.

There’s also a ‘default’ value of QoS between 17 and 25, an unspecified value, and in some circumstances you might come across others.

Are sparse bundles faster than disk images?

The Eclectic Light Company

hoakley

16 December 2024 at 15:30

Following tests over two years ago, I recommended the use of sparse bundles (UDSB) rather than plain UDIF read-write (UDRW) disk images, because of their faster and more consistent performance. Since then, read-write disk images have adopted a sparse file format, making them more space-efficient. However, their write performance has remained substantially lower than sparse bundles, and the latter remain the preferred format.

More recent performance tests have confirmed this in Sequoia, as shown in the table below.

xferdiskimage24

Most recently, tests of file compression showed relatively poor performance within a sparse bundle, with overall compression speed only 57% that of the SSD hosting the bundle despite apparently good write speeds. This article explores why, and whether that should alter choice of disk images.

Consistency in write speed

As previous articles note, inconsistent performance troubles disk images. Because of their complex bundle structure and storage across thousands of band files, it’s important to consider whether this poor performance is merely a reflection of that inconsistency.

sparsebundlewritesp0

The original Stibium write speed measurements, separated out from the collation shown in that recent article, show more dispersion about the regression line than other tests. As those are based on a limited number of file sizes, a total of 130 write speed measurements were performed using file sizes ranging from 4.4 MB to 3.0 GB, 50 of them being randomly chosen.

sparsebundlewritesp1

Those confirm that some individual write speeds are unexpectedly low, and dispersion isn’t as low as hoped for, but don’t make inconsistency a plausible explanation.

Band size

Unlike other types of disk image, data saved in sparse bundles is contained in many smaller files termed bands, and the maximum size of those band files can be set when a sparse bundle is created. By default, band size is set at 8.4 MB and there are only two ways to use a custom size, either when creating the sparse bundle using the command hdiutil create or in my free utility Spundle. Although those allow you to set a preferred band size, the size actually used may differ slightly from that.

Band size could have significant effects on the performance and stability of sparse bundles, as it determines how many band files are used to store data. Using a size that results in more than about 100,000 band files has in the past (with HFS+ at least) made sparse bundles prone to failure.

To assess the effect of different band sizes on compression rate, I therefore repeated exactly the same test file compression in sparse bundles with 12 band sizes ranging from 2.0 to 1020 MB, as measured rather than the value set.

sparsebundlebandstime

In the graph above, time to complete the compression task is shown for each of the band sizes tested, and shows a minimum time (fastest compression) at a band size of 15.4 MB, of about 8.3 seconds, significantly lower than the 11.08 seconds at the default of 8.4 MB. Band sizes between 2.0-8.4 MB showed little differences, but increasing band size to 33.8 MB and greater resulted in much slower compression.

sparsebundlebandsrate

When expressed as compression rates, peak performance at a band size of 15.4 MB is even clearer, and approaches 1.9 GB/s, more than twice that of a read-write disk image, but still far below rates in excess of 2.6 GB/s found for the USB4 and internal SSDs. With band sizes above 33.8 MB, compression rates fall below those of a read-write disk image.

Another way of looking at optimum band size is to convert that into the number of band files required. Fastest rates were observed here in sparse bundles that would have about 6,000-7,000 bands in total when the sparse bundle is full, and in this case with 2,000-2,300 band files in use. Performance fell when the maximum number of band files fell below 3,000, or those in use fell below 1,000.

Streaming

Compression tests impose different demands on storage from conventional read or write benchmarks because file data to be compressed is streamed by reading data from storage, and writing compressed data out to the same storage device. Although source and destination are separate files, allowing simultaneous reading and writing, if those files are stored in common band files, it’s likely that access has to be limited and can’t be simultaneous.

As observed here, that’s likely to result in performance substantially below that expected from single-mode transfer tests. Unfortunately, it’s also extremely difficult to measure attempts to simultaneously read and write from different files in the same storage medium.

Conclusions

Sparse bundles (UDSB) remain generally faster in use than read-write (UDRW) disk images.
Sparse bundle performance is sensitive to band size.
Default 8.4 MB bands aren’t fastest in all circumstances.
When sparse bundle performance is critical, it may be worth optimising band file size instead of using the default.
Tasks that stream data from and back to the same storage are likely to run more slowly in a sparse bundle than on its host storage.

Dismal write performance of Disk Images
Which disk image format?
Disk Images: Performance

Last Week on My Mac: Compression models

The Eclectic Light Company

hoakley

15 December 2024 at 16:00

Over the fifty years since I started using computers, one of the greatest changes has been who does the waiting. In my early days, it was me, in some cases waiting overnight for batches to be processed on a mainframe computer. By the mid 1990s I had to wait a couple of days for some of my optimisation runs to complete on the Mac in my office. Since then, as Macs have got progressively faster, they have come to do the waiting for me. Whether I’m editing an image or writing an article, my Mac keeps pace with almost everything I do.

One notable if infrequent exception is when there’s a macOS update. On a newer Apple silicon model, installation itself is far quicker than when Apple changed to using the SSV in Big Sur, but there’s always that wait for the downloaded update to be “prepared” in the background. Much of that is thought to be decompression, which minimises the size of downloads by putting the burden on the Mac to expand and organise what needs to be installed.

Data compression is widely used in modern operating systems. Take a look in Activity Monitor’s Memory view, and at the bottom right you’ll see how much compressed memory is in use, here currently over 1 GB. It’s used to save space in macOS, where many system files are stored in compressed formats. Compression is also an integral part of many media formats, whether still or moving images, or audio. Unless they’re compressed, we’d only be able to enjoy a small fraction of the media we now consume so voraciously.

It’s understandable that compression is thus one of the best-studied tasks in computing. Over decades of intensive research and practical experience, we have learned that the most powerful methods of compressing data can also take the most processing time. The effectiveness of compression is generally expressed in terms of compression ratio, the ratio of original file size to that of the compressed file. Thus all effective compression methods should result in a compression ratio of more than 1. You’ll also come across percentage space saving, the reduction in size compared to the original. If a compression method shrinks a 10 GB file to 5 GB, then its compression ratio is 10/5 = 2, and its space saving is (1 – 5/10) x 100 = 50%.

One of the most efficient compression techniques has been that in nanozip 0.09a, for instance, which can achieve a compression ratio of 5.3, or a space saving of 81%, on English text taken from Wikipedia. This illustrates the trade-offs made in such efficient techniques, as to achieve that it takes almost 128 seconds to compress 100 MB, a rate of 0.783 MB/s. That makes it of little everyday use and unsuitable for macOS updates, where it also couldn’t achieve the same high compression ratio because the data are binary rather than English text.

Optimising compression isn’t as simple as it might appear. There’s an inverse relationship between the time spent compressing data and the amount of compressed data to be written, as there is with many similar tasks. The more sophisticated the compression and the higher the compression ratio, the less data there is to be written out, although there’s still the same amount of data to be read. Using the example of a compression ratio of 2, 10 GB of input data has to be read and only 5 GB written out, so unless write speed is less than half read speed, reading the input data is likely to be more rate-limiting.

Compression is also representative of many other types of everyday tasks, as it consists of three sub-tasks:

read data from disk
process it (encode, decode, transcode)
write the transformed data to disk

Unless you use small file sizes, this is usually performed on data streamed from storage and back to storage, rather than on single chunks of data in memory.

To understand what limits performance of any of those everyday tasks, we need to understand how each of its sub-tasks is limited, and how those limitations interact. In some cases, such as preparation of macOS updates, time taken appears to be determined by the processing required in the second step. In the past, Apple has run all that in a single thread to minimise impact on concurrent use of other apps in the foreground. In other tasks, such as encrypting files, stage 2 processing should be fastest, making it most likely that disk access is limiting.

Although we can’t apply conclusions drawn from studying any specific compression method, we can use it as a model to develop tools that we can then apply to other tasks. Over the last week, I’ve published two articles here discussing how to do that.

In the first I asked how we can investigate whether more threads will run a task faster. We know that for my compression task they do, but that may not be true for the tasks you’re most interested in accelerating. In rare cases, an app may provide a control to set how many threads to use, and it’s easy to use that to investigate. It’s more likely that your app has no such control, but one way around that is to run it in a virtual machine, where you can pick the number of virtual cores to host that VM.

Timing the performance of your app with test files of known size should give a performance rate, and that can help identify which of those sub-tasks is probably limiting overall performance. If your app transcodes video, for example, and you time how long it takes to complete all three sub-tasks on a 10 GB video clip, you can work out its transcoding rate in GB/s. If that test clip takes 20 seconds to read that test file, transcode it, and write out the converted video, then its overall transcoding rate is 0.5 GB/s.

The next article considers how you might work out whether running that task on high-speed storage would improve that overall performance. Although it looks unlikely that a transcoding rate of 0.5 MB/s would be altered by running it on a Thunderbolt 3 disk with its 2.3 GB/s write speed, it’s worth comparing performance on the faster internal SSD to check that, and decide whether a faster USB4 SSD might be better.

By the time that you’ve worked through those tests, you should have better insight into which of those three sub-tasks is limiting performance, how you might be able to improve them, and compare current performance in the overall transcoding rate against attempts at improvement. As I wrote last week, the only way to do that is with objective measurements such as that transcoding rate.

Next week I’ll continue this series by considering how you can control the type of CPU core threads run on, and some related issues about sparse bundles and the performance of Time Machine backup storage.

Tune for Performance: Do you really need a big internal SSD?

The Eclectic Light Company

hoakley

12 December 2024 at 15:30

The most common economy that many make when specifying their next Mac is to opt for just 512 GB internal storage, and save the $/€/£600 or so it would cost to increase that to 2 TB. After all, if you can buy a 2 TB Thunderbolt 5 SSD for around two-thirds of the price, why pay more? And who needs the blistering speed of that expensive internal storage when your apps work perfectly well with a far cheaper SSD? This article considers whether that choice matters in terms of performance.

To look at this, I’m going to use the same compression task that I used in this previous article, my app Cormorant relying on the AppleArchive framework in macOS to do all the heavy lifting. As results are going to differ considerably when using other apps and other tasks, I’d like to make it clear that this can’t reach general conclusions that apply to every task and every app: your mileage will vary. My purpose here is to show how you can work out whether using a slower disk will affect your app’s performance.

Methods

In that previous article, I concluded that compression speed was unlikely to be determined by disk performance, as even with all 10 P cores running compression threads, that rate only reached 2.28 GB/s, around the same as write speed to a good Thunderbolt 3 SSD. For today’s tests, I therefore set Cormorant to use the default number of threads (all 14 on my M4 Pro) so it would run as fast as the CPU cores would allow when compressing my standard 15.517 GB IPSW test file.

I’m fortunate to have a range of SSDs to test, and here use the 2 TB internal SSD of my Mac mini M4 Pro, an external USB4 enclosure (OWC Express 1M2) with a 2 TB Samsung 990 Pro SSD, and an external 2 TB Thunderbolt 3 SSD (OWC Envoy Pro SX). As few are likely to have access to such a range, I included two disk images stored on the internal SSD, a 100 GB sparse bundle, and a 50 GB read-write disk image, to see if they could be used to model external storage. All tests used unencrypted APFS file systems.

The first step with each was to measure its write speed using Stibium. Unlike more popular benchmarking apps for the Mac, Stibium measures the speed across a wide range of file sizes, from 2 MB to 2 GB, providing more insight into performance. After those measurements, those test files were removed, the large test file copied to the volume, and compressed by Cormorant at high QoS with the default number of threads.

Disk write speeds

compressionbydisk1

Results in each test followed a familiar pattern, with rapidly increasing write speeds to a peak at a file size of about 200 MB, then a steady rate or slow decline up to the 2 GB tested. The read-write disk image was a bit more erratic, though, with high write speeds at 800 and 1000 MB.

At 2 GB file size, write speeds were:

7.69 GB/s for the internal SSD
7.35 GB/s for the sparse bundle
3.61 GB/s for the USB4 SSD
2.26 GB/s for the Thunderbolt 3 SSD
1.35 GB/s for the disk image.

Those are in accord with my many previous measurements of write speeds for those types of storage.

Compression rates

Times to compress the 15.517 GB test file ranked differently:

5.57 s for the internal SSD
5.84 s for the USB4 SSD
9.67 s for the sparse bundle
10.49 s for the Thunderbolt 3 SSD
16.87 s for the disk image.

When converted to compression rates, sparse bundle results are even more obviously an outlier, as shown in the chart below.

compressionbydisk2

There’s a roughly linear relationship between measured write speed and compression rate in the disk image, Thunderbolt 3 SSD, and USB4 SSD, and little difference between the latter and the internal SSD. These suggest that disk write speed becomes the rate-limiting factor for compression when write speed falls below about 3 GB/s, but above that faster disks make little difference to compression rate.

Poor performance of the sparse bundle was a surprise, given how close its write speeds are to those of the host internal SSD. This is probably the result of compression writing a single very large file across its 14 threads; as the sparse bundle stores file data on a large number of band files, their overhead appears to have got the better of it. I will return to look at this in more detail in the near future, as sparse bundles have become popular largely because of their perceived superior performance.

The difference in compression rates between USB4 and Thunderbolt 3 SSDs is also surprisingly large. Of course, a Mac with fewer cores to run compression threads might not show any significant difference: a base M3 chip with 4 P and 4 E cores is unlikely to achieve a compression rate much in excess of 1.5 GB/s on its internal SSD because of its limited cores, so the restricted write speed of a Thunderbolt 3 SSD may not there become the rate-limiting factor.

Conclusions

The rate-limiting step in task performance will change according to multiple factors, including the effective use of multiple threads on multiple cores, and disk performance.
There’s no simple model you can apply to assess the effects of disk performance, and tests using disk images can be misleading.
You can’t predict whether a task will be disk-bound from disk benchmark performance.
Even expensive high-performance external SSDs can result in noticeably poor task performance. Maybe that money would be better spent on a larger internal SSD after all.

Tune for Performance: Activity Monitor’s CPU view

The Eclectic Light Company

hoakley

11 December 2024 at 15:30

Activity Monitor is one of the unsung heroes of macOS. If you understand how to interpret its rich streams of measurements, it’s thoroughly reliable and has few vices. When tuning for performance, it should be your starting point, a first resort before you dig deeper with more specialist tools such as Xcode’s Instruments or powermetrics.

How Activity Monitor works

Like Xcode’s Instruments and powermetrics, the CPU view samples CPU and GPU activity over brief periods of time, displays results for the last sampling period, and updates those every 1-5 seconds. You can change the sampling period used in the Update Frequency section of the View menu, and that should normally be set to Very Often, for a period of 1 second.

This was normally adequate for many purposes with older M-series chips, but thread mobility in M4 chips can be expected to move threads from core to core, and between whole clusters, at a frequency similar to available sampling periods, and that loses detail as to what’s going on in cores and clusters, and can sometimes give the false impression that a single thread is running simultaneously on multiple cores.

However, sampling frequency also determines how much % CPU is taken by Activity Monitor itself. While periods of 0.1 second and less are feasible with powermetrics, in Activity Monitor they would start to affect its results. If you need to see finer details, then you’ll need to use Xcode Instruments or powermetrics instead.

% CPU

Central to the CPU view, and its use in tuning for performance, is what Activity Monitor refers to as % CPU, which it defines as the “percentage of CPU capability that’s being used by processes”. As far as I can tell, this is essentially the same as active residency in powermetrics, and it’s vital to understand its strengths and shortcomings.

Take a CPU core that’s running at 1 GHz. Every second it ‘ticks’ forward one billion times. If an instruction were to take just one clock cycle, then it could execute a billion of those every second. In any given second, that core is likely to spend some time idle and not executing any instructions. If it were to execute half a billion instructions in any given second, and spend the other half of the time idle, then it has an idle residency of 50% and an active residency of 50%, and that would be represented by Activity Monitor as 50% CPU. So a CPU core that’s fully occupied executing instructions, and doesn’t idle at all, has an active residency of 100%.

To arrive at the % CPU figures shown in Activity Monitor, the active residency of all the CPU cores is added together. If your Mac has four P and four E cores and they’re all fully occupied with 100% active residency each, then the total % CPU shown will be 800%.

There are two situations where this can be misleading if you’re not careful. Intel CPUs feature Hyper-threading, where each physical core acquires a second, virtual core that can also run at another 100% active residency. In the CPU History window those are shown as even-numbered cores, and in % CPU they double the total percentage. So an 8-core Intel CPU then has a total of 16 cores, and can reach 1600% when running flat out with Hyper-threading.

coremanintel

cpuendstop

The other situation affects Apple silicon chips, as their CPU cores can be run at a wide range of different frequencies according to control by macOS. However, Activity Monitor makes no allowance for their frequency. When it shows a core or total % CPU, that could be running at a frequency as low as 600 MHz in the M1, or as high as 4,512 MHz in the M4, nine times as fast. Totalling these percentages also makes no allowance for the different processing capacity of P and E cores.

Thus an M4 chip’s CPU cores could show a total of 400% CPU when all four E cores are running at 1,020 MHz with 100% active residency, or when four of its P cores are running at 4,512 MHz with 100% active residency. Yet the P cores would have an effective throughput of as much as six times that of the E cores. Interpreting % CPU isn’t straightforward, as nowhere does Activity Monitor provide core frequency data.

tuneperf1

In this example from an M4 Pro, the left of each trace shows the final few seconds of four test threads running on the E cores, which took 99 seconds to complete at a frequency of around 1,020 MHz, then in the right exactly the same four test threads completed in 23 seconds on P cores running at nearer 4,000 MHz. Note how lightly loaded the P cores appear, although they’re executing the same code at almost four times the speed.

Threads and more

For most work, you should display all the relevant columns in the CPU view, including Threads and GPU.

tuneperf2

Threads are particularly important for processes to be able run on multiple cores simultaneously, as they’re fairly self-contained packages of executable code that macOS can allocate to a core to run. Processes that consist of just a single thread may get shuffled around between different cores, but can’t run on more than one of them at a time.

Another limitation of Activity Monitor is that it can’t tell you which threads are running on each core, or even which type of core they’re running on. When there are no other substantial threads active, you can usually guess which threads are running where by looking in the CPU History window, but when there are many active threads on both E and P cores, you can’t tell which process owns them.

You can increase your chances of being able to identify which cores are running which threads by minimising all other activity, quitting all other open apps, and waiting until routine background activities such as Spotlight have gone quiet. But that in itself can limit what you can assess using Activity Monitor. Once again, Xcode Instruments and powermetrics can make such distinctions, but can easily swamp you with details.

GPU and co-processors

Percentages given for the GPU appear to be based on a similar concept of active residency without making any allowance for frequency. As less is known about control of GPU cores, that too must be treated carefully.

Currently, Activity Monitor provides no information on the neural processing unit (ANE), or the matrix co-processor (AMX) believed to be present in Apple silicon chips, and that available using other tools is also severely limited. For example, powermetrics can provide power use for the ANE, but doesn’t even recognise the existence of the AMX.

Setting up

In almost all cases, you should set the CPU view to display All processes using the View menu. Occasionally, it’s useful to view processes hierarchically, another setting available there.

Extras

Activity Monitor provides some valuable extras in its toolbar. Select a process and the System diagnostics options tool (with an ellipsis … inside a circle) can sample that process every millisecond to provide a call graph and details of binary images. The same menu offers to run a spindump, for investigating processes that are struggling, and to run sysdiagnose and Spotlight diagnostic collections. The Inspect selected process tool ⓘ provides information on memory, open files and ports, and more.

Explainers

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

multiterms1

When an app is run, there’s a single runtime instance created from its single executable code. This is given its own virtual memory and access to system resources that it needs. This is a process, and listed as a separate entity in Activity Monitor.

Each process has a main thread, a single flow of code execution, and may create additional threads, perhaps to run in the background. Threads don’t get their own virtual memory, but share that allocated to the process. They do, though, have their own stack. On Apple silicon Macs they’re easy to tell apart as they can only run on a single core, although they may be moved between cores, sometimes rapidly. In macOS, threads are assigned a Quality of Service (QoS) that determines how and where macOS runs them. Normally the process’s main thread is assigned a high QoS, and background threads that it creates may be given the same or lower QoS, as determined by the developer.

Within each thread are individual tasks, each a quantity of work to be performed. These can be brief sections of code and are more interdependent than threads. They’re often divided into synchronous and asynchronous tasks, depending on whether they need to be run as part of a strict sequence. Because they’re all running within a common thread, they don’t have individual QoS although they can have priorities.

Tune for Performance: do more threads run faster?

The Eclectic Light Company

hoakley

10 December 2024 at 15:30

One of the most distinctive features about modern Macs is that they have multiple cores, in Apple silicon models a minimum of eight. For an app to be able to make use of more than one core (or its equivalent) at a time it needs to divide its processing into threads, discrete blocks of code that can be run on different cores by macOS. If it doesn’t do that, then running that app on a high-end Pro, Max or Ultra chip is unlikely to be significantly faster than on a base model (assuming that task is CPU-bound).

Activity Monitor

You might be able to get a good idea as to how well an app makes use of multiple cores from watching it in use in Activity Monitor’s CPU History window, but in many cases that isn’t conclusive, and needs to be confirmed.

To illustrate ways to tackle this, I take the example of file compression. Some methods lend themselves to multiple threads better than others, and some may be implemented in a way that won’t accelerate when several cores are available. For the occasional user this might make little difference, but if you were to spend much of your day waiting for 10-100 GB files to compress, it merits a little exploration.

polycore1

This CPU History window from a compression task run on an M4 Pro is typically unhelpful. A single file compression is seen in the group of 4-5 peaks in CPU in the right half of each trace. Although compression used all ten P cores, it appears to have been moved between P clusters, seen by comparing the timing of peaks on Cores 8 and 10, 9 and 11. At no time does the task appear to exceed 50% active residency on any of the cores, though. All you can do is guess as to what might be going on, and whether it might run faster on twelve or more P cores.

polycore2

On another run, the same task is reported to have reached a peak of 500% CPU in 12 threads, but would it exceed that on more P cores, given that 10 were available?

Timing performance

As usual, I’ve been cheating a little to generate those results, by using my simple compression-decompression app Cormorant, in which I control how many threads it uses. If the app you’re trying to investigate offers a similar feature, then you can set up a standard task, here the compression of a 15.517 GB IPSW file, and time how long it takes using different numbers of threads.

polycore3

The answers from Cormorant, which conveniently performs its own timing, are:

1 thread takes 49.32 seconds
2 take 26.74 s
3 take 18.60 s
4 take 14.29 s
5 take 11.21 s.

So Cormorant’s compression can make good use of more P cores, although with diminishing returns.

Very few apps give you this level of control, though. The only other compression utility that does appear to is Keka.

polycore4

In its settings, you can give its tasks a maximum number of threads, and even run them at custom Quality of Service (QoS) if you want them to be run in the background on E cores, and not interrupt your work on P cores.

Controlling threads

There is one way that you can limit the effective number of threads used by an arbitrary app, and that’s to run it in a Virtual Machine, as you control the number of virtual cores that it uses. While you can just about run a macOS VM on a single core alone, I suggest that a more workable starting point is two cores, and you can increase that number up to the total of P cores in the host without causing problems.

VMs have other virtues, including their relative lack of background processes, allowing their virtual cores to be almost entirely devoted to your test task. However, as they can’t run apps from the App Store other than Apple’s free suite of Pages, Numbers and Keynote, that could prevent you from using them for testing.

To set up a VM for thread tests, I duplicated a standard Sonoma 14.7.1 VM so I could throw it away at the end, opened it with five virtual cores, and copied over the test app Cormorant and file. I then closed that VM down, set it to use 2 virtual cores, opened it and ran my test. I repeated that with an increasing number of virtual cores up to a total of 5. Compression times are:

2 vCPUs take 30.13 seconds
3 take 21.36 s
4 take 17.18 s
5 take 13.59 s.

Those are only slightly slower than their equivalents from the host tests above.

Analysis

polycore5

Plotting those results out using DataGraph, the lines of best fit follow power laws:

for real cores, time = 49.8863/(T^0.91169)
for virtual cores, time = 54.737/(T^0.854318)

where T is the number of threads. That explains the apparently diminishing returns with increasing numbers of threads, although the maths isn’t as simple as we might like.

A better way to look at this is by calculating the rate of compression in GB/s, simply by dividing the file size of 15.517 GB by each time. Here we end up with straight lines from linear regression, that are more amenable to thought.

polycore6

Those regressions are:

for real cores, rate of compression = 0.0464 + (0.264 x T)
for virtual cores, rate of compression = 0.102 + (0.206 x T)

which are more generally useful.

Tuning performance

This might appear over-elaborate and of little practical use, but we now have a much better understanding of the factors limiting compression performance:

The more threads compression uses, the shorter time a task will take.
There are limits to that improvement, though, when substantially more than 5 threads are used.
Compression rates achieved in M4 P cores are significantly lower than read or write speeds of the internal SSD, so aren’t likely to be limited by the speed of a faster SSD (TB3 or USB4).
As compression appears to be CPU-bound, faster P cores would also be expected to result in shorter times.
Improving the efficiency of the compression code could increase performance.
Compression in a VM runs at about 78% of speed on the host.

To see how reliable these are, I therefore repeated the Cormorant test using all 10 P cores on the host Mac, which took 6.80 seconds, a little more than half the time for 5 cores. That’s a compression rate of 2.28 GB/s, rather less than the 2.69 GB/s predicted by the linear regression. That’s now approaching the write speed of some TB3 SSDs, and the rate-limiting step could then change to be the write performance of the storage being used, rather than CPU cores.

How Thunderbolt 5 can be faster or not

The Eclectic Light Company

hoakley

9 December 2024 at 15:30

Early reports of Thunderbolt 5 performance have been mixed. This article starts from Thunderbolt 3 and dives deeper into the TB5 specification to explain why it might not deliver the performance you expect.

Origins

Thunderbolt 3 claims to deliver up to 40 Gb/s transfer speeds, but can’t actually deliver that to a peripheral device. That’s because TB3 includes both 4 lanes of PCIe data at 32.4 Gb/s and 4 lanes of DisplayPort 1.4 at up to 32.4 Gb/s, coming to a total of nearly 65 Gb/s. But that’s constrained within its maximum of 40 Gb/s.

The end result is that the best you can expect from a Thunderbolt 3 SSD is a read/write speed of around 3 GB/s. Although that would be sufficient for many, Macs don’t come with arrays of half a dozen Thunderbolt 3 ports. If you need to connect external displays, the only practical solution might be to feed them through a Thunderbolt 4 dock or hub. As that’s fed by a single port on the Mac, its total capacity is still limited to 40 Gb/s, and that connection becomes the bottleneck.

Thunderbolt 5 isn’t a direct descendant of Thunderbolt 3 or 4, but is aligned with the second version of USB4. This might appear strange, but USB4 in its original version includes support for Thunderbolt 3. What most obviously changes with USB4 2.0 is its maximum transfer rate has doubled to 80 Gb/s. But even that’s not straightforward.

Architecture

thunderbolt5a

The basic architecture of a Thunderbolt 5 or USB4 2.0 connection is shown above. It consists of two lanes, each of which has two transmitter-receiver pairs operating in one direction at a time and each transferring data at up to 40 Gb/s. This provides a simultaneous total transfer rate of 80 Gb/s in each direction.

These lanes and transmitter-receiver pairs can be operated in several modes, including three for Thunderbolt 5 and USB4 2.0.

thunderbolt5b

Single-lane USB4 is the same as the original USB4 already supported by all Apple silicon Macs, and in OWC’s superb Express 1M2 USB4 enclosure, and in practice its full 40 Gb/s comfortably outperforms Thunderbolt 3 for data transfers.

thunderbolt5c

The first of the new high-speed Thunderbolt 5 and USB4 2.0 modes is known as Symmetric USB4, with both lanes bonded together to provide a total of 80 Gb/s in each direction. This is the mode that an external TB5 SSD operates in, to achieve claimed transfer rates of ‘up to’ 6 GB/s.

thunderbolt5d

The other new mode is Asymmetric USB4. To achieve this, one of the Lanes has one of its transmitter-receiver pairs reversed. This provides a total of 120 Gb/s in one direction, and 40 Gb/s in the other, and is referred to in Thunderbolt 5 as Bandwidth Boost. This can be used upstream, from the peripheral to the Mac host, or more commonly downstream, where it could provide sufficient bandwidth to support high-res displays, for example.

Direct host connections

In practice, these modes should work transparently and to your advantage when connecting a peripheral direct to your Mac. If it’s a display, then the connection can switch to Asymmetric USB4 with its three transmitters in the host Mac, to deliver 120 Gb/s to that display. If it’s for storage or another device moving data in both directions, then Symmetric USB4 is good for 80 Gb/s in each direction, and should deliver those promised 6 GB/s read/write speeds.

Docks and hubs

When it comes to docks and hubs, though, there’s the potential for disappointment. Connect three high-res displays to your dock, and you want them to benefit from Asymmetric USB4 coming downstream from the host Mac. If the dock correctly switches to that mode for its connections to the displays, then it won’t work properly (if at all) if the connection between the dock and host is Symmetric USB4, as that will act as an 80 Gb/s bottleneck for the dock’s downstream 120 Gb/s.

Using a Thunderbolt 5 dock or hub is thus a great enabler, as it takes just one port on your Mac to feed up to three demanding peripherals, but it requires careful coordination of modes, and even then could fall short of your expectations.

Consider a TB5 Mac with a dock connected to one large high-res display, and a TB5 SSD. If modes are coordinated correctly, the Mac and dock will connect using Asymmetric mode to deliver 120 Gb/s downstream to the dock, then Asymmetric again from the dock to the display. But that leaves the SSD with what’s left over from that downstream bandwidth, although it still has 40 Gb/s on the return from the SSD through the dock to the Mac. That TB5 drive is then likely to perform as if it was an old USB4 drive, with perhaps half its normal read/write speeds. Of course, that’s still better than you’d get from a TB4 dock, but not what you paid for.

There’s also the potential for bugs and errors. I wonder if reported problems in getting three 6K displays working through a TB5 hub might come down to a failure to connect from Mac to dock in Asymmetric mode. What if the Mac and dock agree to operate in Asymmetric mode when the sole connected display doesn’t require that bandwidth, thus preventing an SSD from achieving an acceptable read speed?

Who needs TB5?

There will always be those who work with huge amounts of data and need as much speed as they can get. But for many, the most important use for Thunderbolt 5 is in the connection between Mac and dock or hub, as that’s the bottleneck that limits everyday performance. MacBook Pro and Mac mini models that now support TB5 come with three Thunderbolt 5 ports and one HDMI display port. You don’t have to indulge in excess to fill those up: add just one Studio Display and external storage for backups, and that Mac is down to its last Thunderbolt port.

As a result, Thunderbolt docks and hubs have proved popular, despite their bottleneck connection to the Mac. For many, that will be where Thunderbolt 5 proves its worth, provided it can get its modes straight and deliver the better performance we’re paying for.

Last Week on My Mac: Tuning for performance

The Eclectic Light Company

hoakley

8 December 2024 at 16:00

Perhaps the greatest subversion of EVs is their threat to car culture. Over more than a century, modifying and tinkering with internal combustion engines has become a popular obsession, and grown a huge industry devoted to tuning for performance. Replacing those lovingly crafted vehicles with battery-powered appliances seems too much to bear for all those enthusiasts.

Half the Mac articles posted here last week are about improving performance, whether in the CPU cores of the M4 or when connecting to external displays and storage. In comments we compared predictions, benchmarks and experience in our quest for improvement. Most telling, though, are those who report little improvement in the software they use to earn their livelihood, products that have been central to Macs for decades, such as Adobe Photoshop since 19 February 1990. If all that engineering effort has had so little effect, there’s something seriously amiss.

There’s much to be transferred from our experience of tuning vehicles to improving the performance of our apps. First is the appreciation that benchtests don’t necessarily translate into what happens on the road. Second is the need for objective measurements to assess performance relevant to our aims. Third is use of a systematic approach to improvement, in recognising where bottlenecks or constraints are, and addressing them methodically.

Benchtests

When each new Apple silicon chip becomes accessible, there’s a race to post its first Geekbench results and thereby demonstrate how performant it is. YouTube is now filling up with demonstrations of impressive or disappointing figures shown on the dials of Blackmagic speed tests on Thunderbolt 5 devices. It’s significant here that the Blackmagic test displays analogue meters taken from those on traditional car dashboards.

Few ever drill down and ask what those numbers returned by Geekbench mean, nor the relevance of disk speed tests to situations where transfer to or from storage limits app performance. Although Primate Labs provide details of the compendium of tests used to calculate Geekbench scores, it’s impossible to know how those might compare with code run by the apps we use. A Mac with impressive benchmark scores can still be dog-slow when running our daily tasks.

Blackmagic Disk Speed Test appears to report write and read speeds measured for a single file size of 5 GB, but tells you little about how storage might cope with large numbers of smaller files, or those much larger. As results vary between tests, it needs some method of providing the best estimate for a series of results, and a measure of the confidence in that number. As a quick, fun method of checking whether storage is up to the task of recording and playing back video files, it’s ideal, but it’s neither intended nor suitable for more general purposes.

Objective measurements

Subjective assessments are widely known for their power to mislead. If you ever want the thrill of your life, ride a recumbent trike at anything over 30 miles an hour down a steep and winding hill. Because your bum and eyes are so much closer to the road surface it feels like three times that speed in a car. You can see similar effects in non-linear progress bars. When they’re slow to start with and accelerate rapidly from midway on, they appear quicker than the reverse, with an apparently interminable wait for the last ten percent to complete.

My heart sinks when someone tells me that something is slow without being specific about what that something is, and providing numbers to support that impression. The only objective assessments are quantitative, in numbers rather than feelings.

But those numbers must also measure something meaningful. In the past we’ve tried simply timing how long a Mac takes to start up, or to launch an app, as if that’s what we spend all day waiting for. When you only launch an app once a day and restart your Mac every couple of weeks, who cares whether those take a few seconds more or less?

Ideally, you need to identify tasks that you have to wait for repeatedly, with discrete instants at the start and end that are separated by several seconds at least. Those can then be timed fairly accurately and reproducibly to form your own task-specific performance benchmark. Then when you come to the stage of testing out the effects of different settings and hardware, you can compare those times. This could become a bit more technical; when assessing the performance of the macOS Unified log, for instance, I’ve calculated the time in nanoseconds between log entries, but that level of precision is rarely needed.

Identification

Armed with comparisons of the time to perform relevant key tasks, we must then analyse what’s involved in each and determine which step is rate-limiting. Most tasks involve several stages, perhaps starting with reading of data from storage into memory, then processing that and displaying an outcome. Those in turn depend on effective read speed, memory capacity and access time, and a host of operations in CPU cores, GPU and possibly other units in the chip.

The key to understanding performance is Activity Monitor, in its different views. Developers also use Xcode’s valuable collection of Instruments, there are also Signposts widely used in the log, and if you really want to hone in on what’s happening in processor cores there’s always powermetrics. But well before you risk getting lost in those weeds, observations in Activity Monitor should provide a good picture of what’s going on, and where delays are occurring.

Activity Monitor does have its pitfalls, as with any other method of investigation. Perhaps its greatest shortcoming on Apple silicon Macs is in not displaying or taking into account frequencies of the two types of CPU core, but at least its CPU History window does show which type of core is bearing the brunt.

Goal

Behind all this is the need for a more critical approach when tuning our Macs for better performance. Fitting a new exhaust system to a car might make it sound good, but unless you go to the trouble of tuning it for performance, it might actually be slower on the road, all bark and no bite. Unlike Apple’s devices, our Macs haven’t become appliances, and in coming articles I’ll explore how you can tune yours.

Inside M4 chips: CPU core management

The Eclectic Light Company

hoakley

5 December 2024 at 15:30

Whether you’re a developer or user, gaining an understanding of how macOS manages the cores in an Apple silicon CPU is important. It explains what you will see in action when you open Activity Monitor, how your apps get to deliver optimal performance, and why you can’t speed up background tasks like Time Machine backups. In this series (links below) I’ve been trying to piece this together for the M4 family, and this article is my first attempt to summarise as much of the story as I know so far. I therefore welcome your comments, counter-arguments and improvements.

Scope

For the purposes of this article, I’ll consider a single thread that macOS is ready to load onto a CPU core for execution. For that to happen, five decisions are to be made:

which type of core, P or E,
which cluster to run it in,
which core within that cluster,
what frequency to run that cluster at,
the mobility of that thread between cores in the same cluster, and between clusters (when available).

Which type of core?

Since the early days of analysing M1 CPUs, it has been clear that the choice between P and E core types is made on the Quality of Service (QoS) assigned to the thread, availability of a core of that type, and whether the thread uses a co-processor such as the AMX. For the sake of generality and simplicity, I’ll here ignore the last of those, and consider only threads that are executed by the CPU alone.

QoS is primarily set by the process owning that thread, in the setting exposed to the programmer and user, although internally QoS is modulated by other factors including thermal environment. Threads assigned a QoS of 9 or less, designated Background, are allocated exclusively to E cores, while those with higher QoS of ‘user’ levels are preferentially allocated to P cores. When those are unavailable, they may be allocated to E cores instead.

This can be changed on the fly, reassigning higher QoS threads to run on E cores, but that’s not currently possible the other way around, so low QoS threads can’t be run on P cores. The sole exception to that is when run inside a virtual machine, when VM virtual cores are given high QoS, allowing low QoS threads within the VM to benefit from the speed of P cores.

Which cluster?

M4 Pro and Max variants have two clusters of P cores, so the next decision is which of those to run a higher QoS thread on:

if both clusters are shut down, one will be chosen and its frequency brought up;
if one cluster is already running and has sufficient idle residency (‘available residency’) to accommodate the thread, that will be chosen;
if one cluster already has full active residency, the other cluster will be chosen;
if both clusters already have full active residency, then the thread will be allocated to the E cluster instead.

This fills an active P cluster before allocating threads to the inactive one, and fills both P clusters before allocating a higher QoS thread to the E cluster.

m4coremanagement1

Cluster allocation is of course simpler on the base M4, where there’s only one E and one P cluster. Should Apple make an M4 Ultra as expected, then that would not only have four P clusters, but two E clusters. Until that happens, we can only speculate that similar would then apply, including the E clusters.

Which core within the cluster?

This is perhaps the simplest decision to make. If there’s only one core with available residency in that cluster, that’s the only choice. Otherwise macOS picks an arbitrary core from those available, apparently to ensure roughly even use of cores within each cluster.

What frequency?

For the E cluster, choice of frequency appears straightforward, with low QoS threads being run at minimum E core frequency of 1,020 MHz or slightly higher, and higher QoS threads spilt over from fully occupied P clusters are run at E core maximum frequency of 2,592 MHz.

P cluster frequency appears to be determined by the total active residency of that cluster after the new thread has been added. When that’s the only thread running in that cluster, maximum frequency of 4,512 MHz is chosen, but rising total active residency reduces that in steps down to about 3,852 MHz when all the cores in the cluster are at 100% active residency. In most cases, the big reduction in frequency occurs when going from about 200% to 300% total active residency. This currently appears to be part of a strategy to pre-emptively minimise the risk of thermal stress within the chip.

m4coremanagement2

Thread mobility

Once the thread has been loaded into a core in the optimal cluster running at the chosen frequency, it’s likely to be moved periodically, both to any other free core within that cluster, and to another cluster, when available. While this does occur in previous M-series CPUs, it appears particularly prominent in M4 variants.

Movement of threads within a cluster can occur quite frequently, every 0.1 second or so, particularly within the E cluster. Movement between clusters occurs less frequently, about every 4-5 seconds, and would only occur when the other cluster is shut down or idle, so free to run all the threads of the current cluster. This is most probably to ensure even thermal conditions within the chip.

Summary

The whole strategy is shown in the following diagram, also available as a tear-out PDF from here: m4coremanagement1

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Inside M4 chips: Matrix processing and Power Modes
Inside M4 chips: Controlling frequency

Explainer

Inside M4 chips: Controlling frequency

The Eclectic Light Company

hoakley

2 December 2024 at 15:30

To realise best performance and energy efficiency from the big.LITTLE architecture in Apple’s M-series chips requires careful management on the part of macOS. There’s much more to it than balancing loads over conventional multi-core CPUs with a single type of core, as each execution thread needs to be run in an optimal location. When deciding where to run a CPU thread, macOS controls:

which type of core, P or E, primarily determined by the thread’s Quality of Service (QoS), and core availability;
which cluster to run it in, for chips with more than one cluster of that type, set to try to keep as few clusters active as necessary;
which core within that cluster, determined by core availability, and semi-randomised to even out core use;
what frequency to run that cluster at, in turn depending on the core type and the thread’s QoS;
mobility of that thread between cores in the same cluster, and between clusters (when available).

Over the last four years, I have explored the rules apparently used for the first two, and the choice of frequency in E cores. This article looks in more detail at how the frequency of P clusters appears to be determined in M1, M3 and particularly M4 chips.

powermetrics provides more frequency figures than you know what to do with, although most are derived and to some extent imaginary, making reconciliation difficult. In tests reported in the previous article, I used those given as Cluster HW active frequency to demonstrate distinctive patterns seen on M4 P cores running different numbers of test threads.

m4frequenciesByThreads

This graph shows those frequencies for the active P cluster by the number of threads, for floating point, NEON and vDSP_mmul tests detailed previously. Frequencies for the first two tests are identical, at P core maximum for a single thread, then falling sharply from 2 to 3 threads. When more threads are run, a cluster that’s fully active is run at the same frequency as that for 5 threads (P cluster size on this M4 Pro), while the other P cluster follows the same frequencies shown in the graph for the number of threads it’s running.

To examine this further I first climbed a mountain.

Climbing a mountain

For this test I used three copies of my test app to run a total of three identical threads of my in-core floating point test code in a mountain pattern. I first started powermetrics gathering data, then launched the first thread, followed by the second, and then the third. My objective was to observe an initial period when just one test thread was running, a second with the second test thread in addition, a third when all three threads would be running, and then watch the sequence reverse as each thread ended. This is shown in the results below.

m43appflopt1

This chart shows active residencies by core and cluster for the P cores in an M4 Pro, with 5 P cores in each of its two P clusters, during this test. For the first 15 sample periods (1.5 seconds), a single test thread is moved around between cores in the second P cluster (P1). That’s joined by the second thread run on another core in the same cluster, until sample 30, when the third thread is added, pushing the total active residency to 300%.

At that point, all three threads are moved to three cores in the first P cluster (P0), whose bars are shown in blues and green. The first thread completes in sample 37, leaving two threads with 200% active residency to continue in that cluster until sample 50, when the second thread completes, leaving just one running. In sample 54 (5.4 seconds after test start), that one remaining thread is moved back to complete on core P11 in the second cluster late in sample 63.

In that period of 6.3 seconds, each of the two P clusters has run 1, 2 and 3 threads.

m43appflopt2

This graph shows cluster frequencies over the same period, this time given in seconds elapsed rather than sample number. The red line and points show the frequency of cluster P1, and blue for P0. Those undergo step changes when each cluster is running test threads. The inactive cluster is normally shut down with a frequency of 0 MHz, although there are some brief spikes from that as well.

m43appflopt3

Combining active residency bars in yellow with core frequency lines, it’s clear that cluster frequency is close to core maximum at 4,500 MHz when only a single thread is running. With two threads, it’s reduced to 4,400 MHz, and down to 3,900 MHz when all three threads are running. Those changes are symmetrical for loading and unloading clusters, and shown no signs of hysteresis (different values during loading and unloading).

Closer examination gives frequencies of 4,511 MHz at 100% active residency, 4,415 MHz at 200%, and 3,924 MHz at 300%e. The latter is 87% of maximum frequency, a large enough reduction to be reflected in performance. Essentially identical figures are found for NEON tests as well as these for floating point.

Although this test method can give highly reproducible results, the floating point and NEON tests used don’t resemble threads seen in everyday use. The next step is to extend that by looking at thread numbers and frequency when running more normal code.

Compressing a file

Fortunately, I have already built a suitable platform for real-world testing in a one-trick pony named Cormorant, a basic compression-decompression utility using Apple Archive. Although not a patch on serious apps like Keka, Cormorant can set the number and QoS of threads to be run during compression/decompression. Because it relies on Apple’s framework, it actually runs more than just the threads set in its controls, but still provides a way to control active residency. I therefore ran a test compression (15.5 GB IPSW image file) at maximum QoS to ensure it’s dispatched to P cores, and 1-3 threads.

Time taken to compress the test file changes greatly according to the number of threads used:

1 thread takes 49.6 s, at 313 MB/s;
2 threads takes 26.8 s, at 578 MB/s, 191% of the throughput of the single thread;
3 threads takes 18.7 s, at 829 MB/s, 265% of the single thread.

These appear to follow the pattern of frequencies observed on my in-core tests.

m4corm1thread1

This chart shows the opening 3 seconds of single-thread compression, with cluster frequency in the points and line, and total active residency multiplied by 10 (to scale to a common y axis) in pale blue bars. Two significant periods are shown: in samples 12-21, active residency is high, between 300-430%, and frequency is lower at around 4,000 MHz. Following that, active residency falls to about 200% and frequency rises to 4,200 MHz.

Because active residency was so variable in these tests, I pooled paired values for that and cluster frequency, gathered over 3 second periods, and plotted those.

m4corm1thread2

Although at active residencies below about 180% there’s a wide scatter, above that there’s a good linear regression, showing steady decline in frequency over active residencies ranging from 180% to 450%.

The following two graphs show equivalents for tests using 2 and 3 threads. The first of those has two outliers at total active residencies above 490%, corresponding to unusual conditions during the test. I have therefore excluded those from subsequent analyses.

m4corm2thread1

m4corm3thread1

The last step is to pool paired results from all three test conditions, and arrive at a line of best fit.

m4corm1-3Poolthread1

Between total cluster residencies of 150-500%, this works best with a quadratic curve with the equation
F = 4630.05 – (2.0899 x R) + (0.0010107 x R^2)
from which a different relationship is predicted between F, frequency in MHz, and R, total active residency in %.

My other real-world test makes use of the fact that, when virtualising macOS, the number of virtual cores on the host is specified.

Hosting virtual cores

Although virtualisation relies on frameworks run on the host, experience is that its demand on host P cores is constrained to the number of virtual cores allocated, with each of those resulting in 100% active residency, equating to the whole of a P core on the host. powermetrics started collecting sample periods immediately before a macOS 14 VM was launched, and the first 3 seconds (30 samples) were collected and analysed for VMs with 1-3 virtual cores.

m4vm1thread1

This shows the first of those, a VM allocated just a single virtual core, with cluster frequencies shown as red and blue lines, and total active residency multiplied by 10 in the pale blue bars. With a steady total active residency of 100%, active cluster frequency was about 4,500 MHz. Note that sample 7 included transfer of the threads from P1 to P0 in a sharp peak to a total active residency of over 500%.

Average frequencies can thus be calculated for each of the three tests, at 100-300% active residencies.

Set frequencies

I now have estimates of cluster frequencies for cluster total active residencies from:

in-core tests using floating point
in-core tests using NEON
compression
virtualisation

against which I compare a matrix multiplication test that may be run on shared matrix co-processors (AMX). These are shown in the table below.

m4pclusterfreq

Running a single thread in a cluster should result in a total active residency of 100%, for which macOS sets the cluster frequency at P core maximum, of 4,400-4,511 MHz. That for 200% is lower, at between 4,000-4,400 MHz, and falls off further to about 3,800 when all 5 cores are at 100% active residency. Frequencies set for the vDSP_mmul test are significantly lower throughout, supporting the proposal that test isn’t being run conventionally in P cores, but in a co-processor.

A sixth thread would then be loaded onto the other P cluster, where cluster frequency would be set at P core maximum again, progressively reducing with additional threads until that cluster was also running at about 3,800 MHz.

Following this, I returned to the tests I have performed over the last four years on M1 and M3 P cores. Although I haven’t analysed those formally, I now believe that their frequencies are controlled by macOS as follows:

M1 1 core at 3,228 MHz, 2 cores 3,132 MHz, 3-6 cores 3,036 MHz.
M3 1 core at 3,624 MHz (below maximum of 4,056), 2-6 cores 3,576 MHz.

The range of frequencies in the M1 and M3 is narrower, resulting in less difference in performance between single- and multi-core tests. However, the M4 falls to 87% maximum frequency at 3 threads and more, which is substantial. It’s worth noting that Geekbench single-core results for the M4 are around 3,892 and would scale up to a multi-core result of 38,920 on an M4 Pro with 10 P cores, whereas the actual multi-core score is about 22,700, 58% of the scaled value. Although the effects of lower frequency can’t account for all that difference, they must surely contribute to it.

Why?

Two plausible contenders for the reason that macOS reduces P cluster frequency with increasing active residency are for thermal management, hence reliability, and when competing for a limited resource, perhaps the L2 cache shared within each cluster.

Reductions in cluster frequency seen here isn’t thermal throttling, though. Tests were intentionally kept brief in order to accommodate their results in reasonably short series of powermetrics results. Power use was highest in the NEON and vDSP_mmul tests, and lowest in floating point, although there don’t appear to be matching differences in frequency control. As noted in the previous tests, High Power mode didn’t alter frequency control, although frequencies were reduced in Low Power mode.

It’s most likely that this frequency regulation is pre-emptive, and based not just on CPU cores, but allows for likely heat output in the rest of the Mac.

Key information

When running on Apple silicon Macs, macOS modulates ‘cluster HW active frequency’ of P cores, limiting frequency to below maximum when cluster total active residency exceeds 100%.
Although small in M1 variants, this is most prominent in M4 variants, where a total active residency of 300% may reduce cluster frequency to 87% of maximum.
Frequency limitation is most probably part of a pre-emptive strategy in thermal management.
Frequency limitation is at least partly responsible for non-linear changes in performance with increasing recruitment of P cores, as illustrated in single- and multi-core benchmarks.
Control of P cores by macOS is complex, particularly in M4 variants.

Explainer

Acknowledgements

Several of you have contributed to discussions here, but Maynard Handley has for several years provided sage advice, challenging discussion, and his personal mine of information. Thank you all.

Inside M4 chips: Matrix processing and Power Modes

The Eclectic Light Company

hoakley

27 November 2024 at 15:30

For much of the last four years, CPU and GPU core performance have been of primary importance to those using Apple silicon Macs. Despite that, special interest has developed in those cores that Apple doesn’t talk about, in its matrix co-processor, AMX. Since the first M1 it has been believed that each CPU core cluster has its own AMX, and more recently it has been shown to be capable of impressive performance. With the increasing prominence of computationally intensive features including AI, this is growing in importance.

When the M4 first became available in iPads earlier this year, researchers concluded its chip has a new AMX co-processor that can now be programmed using SME matrix extensions supported by the new version of the Arm instruction set in the M4, ARMv9.2-A. The best accounts of that challenging work are given on the Friedrich Schiller University’s site Hello SME, and Taras Zakharko’s GitHub.

From early work by Dougall Johnson on the M1, it has been known that some of the functions in Apple’s vast Accelerate maths libraries can run code on the AMX. Thanks to the guidance of Maynard Handley, a year ago I concluded that one of those is the vDSP_mmul function in the vDSP sub-library. This article reports tests of that function in a Mac mini M4 Pro running Sequoia 15.1.1, leads on to an explanation of previous results using floating point and NEON tests, and considers the effects of Power Modes.

Methods

I used the same methods as described previously. These consist of running many tight loops of test code designed to be confined as much as possible to register access. The test used here uses vDSP_mmul, multiplying two 16 x 16 32-bit floating point matrices, and consists of 15 x 10^6 such loops. Its source code is given in the Appendix at the end. During each test run, the command tool powermetrics gathers core power and performance data in sampling periods of 0.1 second for a total of 50 samples. Its output is piped into a text file, which is then analysed using Numbers and DataGraph.

For comparison, I also show here my results from equivalent tests using assembly code for the NEON vector processor, as given in that previous article.

Power used by thread

The first graphs show average power use during each test with increasing numbers of threads, where error bars indicate a spread of +1 standard deviation.

m4powervdsp1

The linear relationship for 1-10 threads, run on 1-10 P cores, is a better fit for vDSP_mmul, shown above, than NEON below. Run on P cores, each vDSP_mmul thread used about 3.6 W, significantly greater than NEON at 3.0 W. However, when those high QoS threads spilt over onto E cores, that relationship for vDSP_mmul broke down, leaving the highest power use that for 10 threads, one for each P core available in the M4 Pro chip used.

There’s no evidence of the steps seen below in NEON at 2-3 and 7-8 threads.

m4powerneon1

Execution time

m4powervdsp2

Total execution time has a strongly linear relationship too, for vDSP_mmul at about 2.6 seconds per thread. Again, that relationship broke down once test threads had spilt over onto E cores, unlike in the graph below for NEON.

m4powerneon2

m4powervdsp3

There are also obvious differences in the time required to execute each thread. vDSP_mmul (above) showed a linear increase from 1-5 threads, followed by a constant time for 5-10 threads. Once threads were running on E cores, the relationship became far looser, as seen in the red regression line. In NEON below, the relationship was weakest in the 1-5 thread range, and closer on the E cores from 11-14 threads.

m4powerneon3

Energy use

m4powervdsp4

When run on P cores, there’s a good line of fit indicating energy use of 9.2 J per thread (above), compared with NEON (below) at 7.7 J. The red line of best fit for the E core section from 11-14 cores suggests E core energy use of about 4.8 J per thread, again higher than NEON, which shows a much better fit and only about 3 J per thread.

m4powerneon4

Maximum total energy use estimated for vDSP_mmul was just over 140 J, while for NEON it was only about 90 J.

The overall picture of vDSP_mmul is thus different from those seen in floating point and NEON tests. When run on P cores alone, vDSP_mmul behaves more linearly, using significantly more power and energy. Once running threads on E cores, though, that breaks down and performance falls, rather than simply slowing.

The role of frequency

There has long been a tacit assumption that, when running on P cores, computationally intensive threads such as those used in these tests are run at a fairly constant frequency close to maximum. Looking back at my earlier results on M1 and M3 cores, though, frequencies aren’t so consistent, and in many cases not that close to maximum either.

powermetrics provides more frequency figures than you know what to do with, although most are derived and to some extent imaginary, making reconciliation difficult. Taking the best estimate of core frequency as that given as Cluster HW active frequency, patterns seen on M4 P cores are distinctive.

m4frequenciesByThreads

This graph shows those frequencies for the active P cluster by the number of threads, for floating point, NEON and vDSP_mmul tests. Frequencies for the first two are identical, at P core maximum for a single thread, then falling sharply from 2 to 3 threads. When more threads are run, the cluster that’s fully active is run at the same frequency as that for 5 threads (P cluster size on this M4 Pro), while the other P cluster follows the same frequencies shown in the graph for the number of threads it’s running.

Frequencies are controlled by macOS, and this suggests it adopts a standard pattern when running the two in-core tests, and a different one for vDSP_mmul presumably geared to performance of the AMX. Changes in frequency also account for the steps seen at 2-3 and 7-8 (= 5+2 and 5+3) threads in the NEON graphs above.

Power Modes

There has been considerable interest in the Power Mode setting available in macOS when running on M4 Pro and Max chips. To assess its effects I ran tests in 10 threads to fully occupy the P cores, at each of the three Power mdes.

There was no difference between results for the default Automatic and High Power modes, as expected. This is because the effect of High Power mode isn’t to change frequencies or power use in the short-term, but by more aggressive fan use enabling higher frequencies to be sustained for prolonged periods, when in Automatic mode they would be throttled.

Low Power does have substantial effects on core frequency, performance and power use, though. When running floating point tests in 10 threads, their cluster frequency was reduced from 3,852 to 3,624, 94% of Automatic and High Power. That reduced power use from a mean of 13.9 W to 11.2 W, and increased the time to complete threads. Time taken by floating point threads increased to 106% of Automatic and High, while that for NEON increased to 135% and vDSP_mmul to 177%. While the reduced performance for floating point threads is unlikely to be noticeable, for vector and matrix threads that’s likely to obvious to the user.

Key information

vDSP_mmul matrix multiplication from the vDSP sub-library in Accelerate behaves consistently with it being performed in the AMX co-processors in M4 Pro chips.
vDSP_mmul threads used significantly more power than NEON, reaching a maximum of just over 36 W when fully occupying all 10 P cores.
When spilt over to E cores, vDSP_mmul threads were much slower and their performance erratic, consistent with the E cluster having a smaller and significantly less performant AMX.
In-core tests (floating point and NEON) show common frequency regulation according to the number of cores active in each P core cluster. This runs a single thread at maximum frequency, then reduces sharply from 2 to 3 threads/cores. This accounts for the deviations from linearity observed in power use and performance. That pattern doesn’t appear in vDSP_mmul threads, though.
High Power and Automatic modes are identical in short-term tests.
Low Power mode reduces P cluster frequency and power use. Although its effects are unlikely to be noticeable in floating point threads, effects on vector and matrix threads are greater, and performance reductions are likely to be obvious to the user.

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Finding and evaluating AMX co-processors in Apple silicon chips (M1 and M3)

Appendix: Source code

16 x 16 32-bit floating point matrix multiplication

var theCount: Float = 0.0 let A = [Float](repeating: 1.234, count: 256) let IA: vDSP_Stride = 1 let B = [Float](repeating: 1.234, count: 256) let IB: vDSP_Stride = 1 var C = [Float](repeating: 0.0, count: 256) let IC: vDSP_Stride = 1 let M: vDSP_Length = 16 let N: vDSP_Length = 16 let P: vDSP_Length = 16 A.withUnsafeBufferPointer { Aptr in B.withUnsafeBufferPointer { Bptr in C.withUnsafeMutableBufferPointer { Cptr in for _ in 1...theReps { vDSP_mmul(Aptr.baseAddress!, IA, Bptr.baseAddress!, IB, Cptr.baseAddress!, IC, M, N, P) theCount += 1 } } } } return theCount

Apple describes vDSP_mmul() as performing “an out-of-place multiplication of two matrices; single precision.” “This function multiplies an M-by-P matrix A by a P-by-N matrix B and stores the results in an M-by-N matrix C.”

Inside M4 chips: CPU power, energy and mystery

The Eclectic Light Company

hoakley

25 November 2024 at 15:30

Few comparisons or benchmarks for M-series chips take into account the reason for equipping Apple silicon chips with more than one CPU core type, according to Arm’s big.LITTLE architecture. Measuring single- or multi-core performance ignores the purpose of E cores, and estimating overall power use can’t compare those core types. This article tries to estimate the cost in terms of power and energy of running identical tests on M4 P and E cores, and thereby provide insight into some of the most distinctive features of Apple silicon, and their benefits.

Methods

To run these two in-core performance tests I use a GUI app wrapped around a series of loading tests designed to enable the CPU core to execute that code as fast as possible, and with as few extraneous influences as possible. Both tests used here are written in assembly code, and aren’t intended to be purposeful in any way, nor to represent anything that real-world code might run. Those are:

64-bit floating point arithmetic, including an FMADD instruction to multiply and add, and FSUB, FDIV and FADD for subtraction, division and addition;
32-bit 4-lane dot-product vector arithmetic (NEON), including FMUL, two FADDP and a FADD instruction;

Source code of the loops is given in the Appendix.

The GUI app sets the number of loops to be performed, and the number of threads to be run. Each set of loops is then put into the same Grand Central Dispatch queue for execution, at a Quality of Service of the maximum of 33. That ensures they are run preferentially on P cores, but will spill over to E cores when no P core is available, when more than 10 threads are run concurrently. Timing of thread execution is performed using Mach Absolute Time, and the time for each thread to be executed is displayed at the end of the tests.

For these tests, the total number of loops to be executed in each thread was set at 5 x 10^8 for floating point, and 3.5 x 10^9 for NEON. Those values were chosen to take 2-3 seconds per thread, to ensure the whole test period was available for analysis.

Immediately before running each test, I launch powermetrics from the command line, to gather core power and performance data in sampling periods of 0.1 second for a total of 50 samples. Its output is piped into a text file, which is then analysed using Numbers and DataGraph. All tests were conducted on a Mac mini M4 Pro with 10 P and 4 E cores, running macOS 15.1.1 in standard power mode.

Each test was inspected individually, and seen to contain the following phases:

small initial activity resulting from bringing the GUI app into focus, and clicking the Run button;
a brief period of low activity, typically with total CPU power at below 50 mW;
1-2 sample periods when threads are loaded onto the cores;
15-21 sample periods when threads are run, whose total CPU power measurements are collected for analysis;
1-2 sample periods when threads are unloaded;
a return to low activity, typically with total CPU power returning below 50 mW.

Means and standard deviations were then calculated for each series of power measurements, and pooled with times taken to execute threads.

Power used by thread

The first pair of graphs shows average power use for the number of threads run, shown here with error bars giving the range of +1 standard deviation. These show two sections: for 1-10 threads, when all were running on P cores, and for 11-14 threads, when the 10 P cores were fully committed and 1-4 threads spilt over to run on E cores at their maximum frequency. Maximum power used during testing was just short of 34 W.

m4powerflopt1

That for the floating point test above, and NEON below, have regression lines fitted, indicating that:

Each additional floating point thread required 1,300 mW on P cores, and 110 mW on E cores.
Each additional NEON thread required 3,000 mW on P cores, and 280 mW on E cores.
P cores thus required 11-12 times the power of E cores, or E cores used 8-9% of the power of P cores.

m4powerneon1

Although linear regressions aren’t a bad fit, there’s consistent deviation from the linear relationship seen in previous analyses on M1 and M3 cores. More remarkably, the pattern of deviation is identical between these two tests, although they run in different units in these cores. In both cases, power use was high for 2 and 7 threads, while that for 3 and 8 threads was slightly lower. The only unusual pattern seen in powermetrics output was that, when running 2 and 7 threads, thread mobility was much higher than in other tests.

Previous tests on M1 and M3 P cores found that each additional floating point thread run on those requires about 935 mW, indicating a substantial increase in power used by M4 P cores when running at their higher maximum frequency. E cores in an M1 Pro require about 100 mW each when running at maximum frequency, similar to those in the M4.

Execution time

As power is the rate of energy use over time, the next step is to examine total execution time for all the threads running concurrently, which should form a linear relationship with different gradients for P and E cores. The next two graphs demonstrate that.

m4powerflopt2

For both floating point (above) and NEON (below), there’s a tight linear relationship between total execution time and numbers of threads. Floating point demonstrates that each thread costs 2.4 seconds on P cores and 3.6 seconds on E cores, making E core execution time 150% that of P cores. NEON is similar, at 2.5 seconds on P cores and 3.4 seconds on E cores, for a ratio of 136%.

m4powerneon2

Time taken for the slowest thread to complete execution shows interesting finer detail.

m4powerflopt3

For both tests, performance falls into several sections according to the number of threads run. With less than 5 threads run, there’s a sharp rise in time taken per thread. From 5-10 threads, time required remains constant, before increasing from 10-14 threads, when additional threads are spilt over onto E cores.

This has implications for anyone trying to measure core performance, as it demonstrates that a single thread can run disproportionately fast, compared with 3-10 threads. Basing any conclusion or comparison on a single thread completing in little more than 2 seconds, when 5 concurrent threads would take 2.34 seconds, 117% of the single thread, could be misleading.

m4powerneon3

Energy use

Although power use determines heat production, so is an important factor in determining cooling requirements, total energy required to execute threads is equally important for Macs running from battery. Simply reducing core frequency will reduce power used, but by extending the time taken to complete tasks, it may have no effect on energy used, and battery endurance. My final two graphs therefore show estimated total energy used when running test threads on P and E cores, the ultimate test of any big.LITTLE CPU design such as that in the M4.

m4powerflopt4

Graphs for floating point (above) and NEON (below) are inevitably similar in form to those for power, with a near-linear section from 1-10 cores, when the threads are run only on P cores, and from 11-14 cores when they also spill over to E cores.

Fitted regression lines provide the energy cost for each additional thread:

For floating point, each thread run on a P core costs 3.1 J, and for an E core 1.5 J, making the energy used by an E core 47% that of a P core.
For NEON, P cores cost 7.7 J per thread, and E cores 3.0 J, making the energy used by an E core 38% that of a P core.

It’s important to remember that the E cores here aren’t being run at frequencies for high efficiency, but at their maximum so they can substitute for the P cores that are already in use.

m4powerneon4

Considering the small deviations from those linear relationships, it appears that running 2, 6 or 7 threads on P cores requires slightly more energy than predicted from the regression lines shown.

Unfortunately, assessing the energy used by E cores running at low frequencies, as they normally do when performing background tasks, is fraught with inaccuracies due to their low power use. My previous estimate for floating point tests is that a slow-running E core uses less than 45 mW per thread, and for the same task requires about 7% of the energy used by a P core running at maximum frequency, but I have lower confidence in the accuracy of those figures than in those above for higher frequencies.

Key information

When running the same code at maximum frequency, E cores used 8-9% of the power of P cores.
Power use when running 2 or 7 threads was anomalously high, possibly due to high thread mobility.
Execution on E cores was significantly slower than on P cores, at 136-150% of the time required on P cores.
Single-core performance measurements may not be accurate reflections of performance on multiple cores.
When running the same code at maximum frequency, energy used by an E core is expected to be 38-47% that of a P core.

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance

Appendix: Source code

_fpfmadd: STR LR, [SP, #-16]! MOV X4, X0 ADD X4, X4, #1 FMOV D4, D0 FMOV D5, D1 FMOV D6, D2 LDR D7, INC_DOUBLE fp_while_loop: SUBS X4, X4, #1 B.EQ fp_while_done FMADD D0, D4, D5, D6 FSUB D0, D0, D6 FDIV D4, D0, D5 FADD D4, D4, D7 B fp_while_loop fp_while_done: FMOV D0, D4 LDR LR, [SP], #16 RET

_neondotprod: STR LR, [SP, #-16]! LDP Q2, Q3, [X0] FADD V4.4S, V2.4S, V2.4S MOV X4, X1 ADD X4, X4, #1 dp_while_loop: SUBS X4, X4, #1 B.EQ dp_while_done FMUL V1.4S, V2.4S, V3.4S FADDP V0.4S, V1.4S, V1.4S FADDP V0.4S, V0.4S, V0.4S FADD V2.4S, V2.4S, V4.4S B dp_while_loop dp_while_done: FMOV S0, S2 LDR LR, [SP], #16 RET

Inside M4 chips: CPU core performance

The Eclectic Light Company

hoakley

20 November 2024 at 15:30

There’s no doubt that the CPUs in M4 chips outperform their predecessors. General-purpose benchmarks such as Geekbench demonstrate impressive rises in both single- and multi-core results, in my experience from 3,191 (M3 Pro) to 3,892 (M4 Pro), and 15,607 (M3 Pro) to 22,706 (M4 Pro). But the latter owes much to the increase in Performance (P) core count from 6 to 10. In this series I concentrate on much narrower concepts of performance in CPU cores, to provide deeper insight into topics such as core types and energy efficiency. This article examines the in-core performance of P and E cores, and how they differ.

P core frequencies have increased substantially since the M1. If we set that as 100%, M3 P cores run at around 112-126% of that frequency, and those in the M4 at 140%.

E cores are more complex, as they have at least two commonly used frequencies, that when running low Quality of Service (QoS) threads, and that when running high QoS threads that have spilt over from P cores. Low QoS threads are run at 77% of M1 frequency when on an M3, and 105% on an M4. High QoS threads are normally run at higher frequencies of 133% on the M3 E cores (relative to the M1 at 100%), but only 126% on the M4.

Methods

To measure in-core performance I use a GUI app wrapped around a series of loading tests designed to enable the CPU core to execute that code as fast as possible, and with as few extraneous influences as possible. Of the seven tests reported here, three are written in assembly code, and the others call optimised functions in Apple’s Accelerate library from a minimal Swift wrapper. These tests aren’t intended to be purposeful in any way, nor to represent anything that real-world code might run, but simply provide the core with the opportunity to demonstrate how fast it can be run at a given frequency.

The seven tests used here are:

64-bit integer arithmetic, including a MADD instruction to multiply and add, a SUBS to subtract, an SDIV to divide, and an ADD;
64-bit floating point arithmetic, including an FMADD instruction to multiply and add, and FSUB, FDIV and FADD for subtraction, division and addition;
32-bit 4-lane dot-product vector arithmetic (NEON), including FMUL, two FADDP and a FADD instruction;
simd_float4 calculation of the dot-product using simd_dot in the Accelerate library.
vDSP_mmul, a function from the vDSP sub-library in Accelerate, multiplies two 16 x 16 32-bit floating point matrices, which in M1 and M3 chips appears to use the AMX co-processor;
SparseMultiply, a function from Accelerate’s Sparse Solvers, multiplies a sparse and a dense matrix, that may use the AMX co-processor in M1 and M3 chips.
BNNSMatMul matrix multiplication of 32-bit floating-point numbers, here in the Accelerate library, and since deprecated.

Source code of the loops is given in the Appendix.

The GUI app sets the number of loops to be performed, and the number of threads to be run. Each set of loops is then put into the same Grand Central Dispatch queue for execution, at a set Quality of Service (QoS). Timing of thread execution is performed using Mach Absolute Time, and the time for each thread to be executed is displayed at the end of the tests.

I normally run tests at either the minimum QoS of 9, or the maximum of 33. The former are constrained by macOS to be run only on E cores, while the latter are run preferentially on P cores, but may spill over to E cores when no P core is available. All tests are run with a minimum of other activities on that Mac, although it’s not unusual to see small amounts of background activity on the E cores during test runs.

The number of loops completed per second is calculated for two thread totals for each of the three execution contexts. Those are:

P cores alone, based on threads run at high QoS on 1 and 10 P cores;
E cores at high frequency (‘fast’), run at high QoS on 10 cores (no threads on E cores) and 14 (4 threads on E cores);
E cores at low frequency (‘slow’), run at low QoS on 1 and 4 E cores.

Results are then corrected by removing overhead estimated as the rate of running empty loops. Finally, each test is expressed as a percentage of the performance achieved by the P cores in an M1 chip. Thus, a loop rate double that achieved by running the same test on an M1 P core is given as 200%.

P core performance

As in subsequent sections, these are shown in the bar chart below, in which the pale blue bars are for M1 P cores, dark blue bars for M3 P cores, and red for M4 P cores.

m134coreperf1

As I indicated in my preview of in-core performance, there is little difference between integer performance between M3 and M4 P cores, but a significant increase in floating point, which matches that expected by the increased frequency in the M4.

Vector performance in the NEON and simd dot tests, and matrix multiplication in vDSP mmul rise higher than would be expected by frequency differences alone, to over 160% of M1 performance. The latter two tests are executed using Accelerate library calls, so there’s no guarantee that they are executed the same on different chips, but the first of those is assembly code using NEON instructions. SparseMultiply and BNNS matmul are also Accelerate functions whose execution may differ, and don’t fare quite as well on the M4.

E core slow performance

m134coreperf2

On E cores, threads run at low QoS are universally at frequencies close to idle, as reflected in their performance, still relative to an M1 P core at much higher frequency. Frequency differences account for the relatively poor performance of M3 E cores, and improvement in results for the M4. Those are disproportionate in vector and matrix tests, which could be accounted for by the M4 E core running those at higher frequencies than would be normal for low QoS threads.

Although the best of these, NEON, is still well below M1 P core performance (73%), this suggests a design decision to deliver faster vector processing on M4 E cores, which is interesting.

E core fast performance

m134coreperf3

Ideally, when high QoS threads overspill from P cores, it’s preferable that they’re executed as fast as they would have been on a P core. Those in the M1 fall far short of that, in scalar and vector tests only delivering 40-60% of a P core, although that seemed impressive at the time. The M3 does considerably better, with vector and one matrix test slightly exceeding the M1 P core, and the M4 is even faster in vector calculations, peaking at over 130% for NEON assembly code.

Far from being a cut-down version of its P core, the M4 E core can now deliver impressive vector performance when run up to maximum frequency.

M4 P and E comparison

Having considered how P and E cores have improved against those in the M1, it’s important to look at the range of computing capacity they provide in the M4. This is shown in the chart below, where pale blue bars are P cores, red bars E cores at high QoS and frequency, and dark blue bars E cores at low QoS and frequency. Again, these are all shown relative to P core performance in the M1.

m134coreperf4

Apart from the integer test, scalar floating point, vector and matrix calculations on P cores range between 140-175% those of the M1, a significant increase on those expected from frequency increase alone. Scalar and vector (but not matrix) calculations on E cores at high frequency are slower, although in most situations that shouldn’t be too noticeable. Performance does drop off for E cores at low frequency, though, and would clearly have impact for code.

Given the range of operating frequencies, P and E cores in the M4 chip deliver a wide range of performance at different power levels, and its power that I’ll examine in the next article in this series.

Key information

M4 P core maximum frequency is 140% that of the M1. That increase in frequency accounts for much of the improved P core performance seen in M4 chips.
E core frequency changes are more complex, and some have reduced rather than risen compared with the M1.
P core floating point performance in the M4 has increased as would be expected by frequency change, and vector and matrix performance has increased more, to over 160% those of the M1.
E core performance at low QoS and frequency has improved in comparison to the M3, and most markedly in vector and matrix tests, suggesting design improvements in the latter.
E core performance at high QoS and frequency has also improved, again most prominently in vector tests.
Across their frequency ranges, M4 P and E cores now deliver a wide range of performance and power use.

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores

Appendix: Source code

_intmadd: STR LR, [SP, #-16]! MOV X4, X0 ADD X4, X4, #1 int_while_loop: SUBS X4, X4, #1 B.EQ int_while_done MADD X0, X1, X2, X3 SUBS X0, X0, X3 SDIV X1, X0, X2 ADD X1, X1, #1 B int_while_loop int_while_done: MOV X0, X1 LDR LR, [SP], #16 RET

func runAccTest(theA: Float, theB: Float, theReps: Int) -> Float { var tempA: Float = theA var vA = simd_float4(theA, theA, theA, theA) let vB = simd_float4(theB, theB, theB, theB) let vC = vA + vA for _ in 1...theReps { tempA += simd_dot(vA, vB) vA = vA + vC } return tempA }

16 x 16 32-bit floating point matrix multiplication

Apple describes vDSP_mmul() as performinng “an out-of-place multiplication of two matrices; single precision.” “This function multiplies an M-by-P matrix A by a P-by-N matrix B and stores the results in an M-by-N matrix C.”

Sparse matrix multiplication

var theCount: Float = 0.0 let rowCount = Int32(4) let columnCount = Int32(4) let blockCount = 4 let blockSize = UInt8(1) let rowIndices: [Int32] = [0, 3, 0, 3] let columnIndices: [Int32] = [0, 0, 3, 3] let data: [Float] = [1.0, 4.0, 13.0, 16.0] let A = SparseConvertFromCoordinate(rowCount, columnCount, blockCount, blockSize, SparseAttributes_t(), rowIndices, columnIndices, data) defer { SparseCleanup(A) } var xValues: [Float] = [10.0, -1.0, -1.0, 10.0, 100.0, -1.0, -1.0, 100.0] let yValues = [Float](unsafeUninitializedCapacity: xValues.count) { resultBuffer, count in xValues.withUnsafeMutableBufferPointer { denseMatrixPtr in let X = DenseMatrix_Float(rowCount: 4, columnCount: 2, columnStride: 4, attributes: SparseAttributes_t(), data: denseMatrixPtr.baseAddress!) let Y = DenseMatrix_Float(rowCount: 4, columnCount: 2, columnStride: 4, attributes: SparseAttributes_t(), data: resultBuffer.baseAddress!) for _ in 1...theReps { SparseMultiply(A, X, Y) theCount += 1 } } count = xValues.count } return theCount

Apple describes SparseMultiply() as performing “the multiply operation Y = AX on a sparse matrix of single-precision, floating-point values.” “Use this function to multiply a sparse matrix by a dense matrix.”

Sparse matrix multiplication

var theCount: Float = 0.0 let inputAValues: [Float] = [ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 ] let inputBValues: [Float] = [ 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4] var inputADescriptor = BNNSNDArrayDescriptor.allocate( initializingFrom: inputAValues, shape: .imageCHW(3, 4, 2)) var inputBDescriptor = BNNSNDArrayDescriptor.allocate( initializingFrom: inputBValues, shape: .tensor3DFirstMajor(3, 2, 2)) var outputDescriptor = BNNSNDArrayDescriptor.allocateUninitialized( scalarType: Float.self, shape: .imageCHW(inputADescriptor.shape.size.0, inputADescriptor.shape.size.1, inputBDescriptor.shape.size.1)) for _ in 1...theReps { BNNSMatMul(false, false, 1, &inputADescriptor, &inputBDescriptor, &outputDescriptor, nil, nil) theCount += 1 } inputADescriptor.deallocate() inputBDescriptor.deallocate() outputDescriptor.deallocate() return theCount

Inside M4 chips: E and P cores

The Eclectic Light Company

hoakley

18 November 2024 at 15:30

In the two previous articles (links at the end), I explored some of the features and properties of Performance (P) cores in Apple’s latest M4 chips. This article looks at their Efficiency (E) cores by comparison.

M4 family

In the three current M4 designs, there are only two variations in terms of E cores:

Base M4, with 6 E cores, except for a cheaper variant with only 4 active E cores.
M4 Pro and Max, with 4 E cores, including ‘binned’ variants.

Apple is expected to release an Ultra variant in 2025, with two M4 Max chips in tandem, providing a total of 8 E cores. Apart from the number of cores, all E cores are the same, and different from P cores.

E core architecture

All E cores are arranged in a single cluster of 4 or 6, sharing common L2 cache, and running at the same frequency (clock speed). Analysis of M1 cores implies that each E core has roughly half the number of processing units, where there is more than one such unit in the P core, giving an M1 E core roughly half the compute capacity of the P core. I haven’t seen any comparable analysis of cores in later M families, although differences in power consumption imply there remain substantial differences in processing units and compute capacity.

Frequency

Like P cores, E cores can be set to run at any of 5 values between the minimum of 1,020 MHz and maximum of 2,592 MHz (1.0-2.6 GHz). When running macOS, cluster frequency is set by macOS at a kernel level; other operating systems may offer more direct control. This range of frequencies is significantly narrower than that of E cores in the M3, which range between 744-2,748 MHz.

E cores idle at 1,020 MHz, and although they can be shut down altogether, that’s exceptional given the steady demand for macOS background threads to be run on them. Nevertheless, powermetrics still reports their ‘down’ residencies separately from idle residencies.

Instruction set

This is believed to be identical to ARMv9.2-A without Scalable Vector Extension (SVE) supported by M4 P cores, enabling the same threads to be run on either core type.

Single thread comparisons

One way to appreciate the contrasts between core types is to compare a single intensive in-core thread run in each. For this purpose, I used a tight loop of floating point calculations, running at two different Quality of Service (QoS) settings, in macOS 15.1.

Single thread at high QoS on P cores

m4singlePflopt1

This thread was initially loaded onto P13 (red) in the second (P1) cluster, and after 3.7 seconds was moved to P5 (blue) in the first (P0) cluster. After a further 4.6 seconds running on that, it was moved back to the second (P1) cluster, to run on P11 (purple). During this run, there was almost no other activity on the two P clusters, and the inactive cluster was therefore shut down while this thread was running on the other.

m4singlePflopt2

The active cluster was run at the maximum frequency of 4,511 MHz throughout. Just before the thread was moved to a different cluster, that was brought up and run up to maximum frequency ready to run the thread.

m4singlePflopt3

Total CPU power remained similar throughout the period the thread was being executed, but there is a small and consistent difference according to which cluster was active: the first (P0) brought power use of about 2,520 mW, 50 mW higher than the second (P1) at about 2,470 mW. This matches the difference reported previously, and merits assessment in other M4 Pro chips to determine whether this is a general feature.

Single thread at high QoS on E cores

There are methods of running code, such as the in-core floating point loop test used here, on E cores: they can be run with a low QoS (Background), so that macOS allocates them to run on only E cores, or they can be spilt over from high QoS threads when there are more threads than available P cores. On an M4 Pro chip, that requires 11 threads, which results in one of those being allocated to the E cluster, as described next.

m4singlePonEflop1

This chart shows active residency on the four E cores with a single high QoS thread spilt onto them. While cores E1, E2 and E3 appear to handle other threads over this period of more than six seconds, core E0 appears to run at 90-100% active residency executing the spilt thread. Note that this thread wasn’t moved between cores over that period of over six seconds.

E cluster frequency remained constant throughout at its maximum of 2,592 MHz. CPU power use was inevitably dominated by the ten P cores running at 100% active residency and maximum frequency, remaining at just under 14,000 mW. Unfortunately, using powermetrics it’s not possible to estimate the power use of the E cluster directly.

Single thread at low QoS on E cores

This is very different from the spilt thread at high QoS.

m4singleEflop1

There’s no evidence here that any single core in the E cluster ran a thread at 100% active residency. Instead it appears to have been moved rapidly and freely around the cores, with many 0.1 second sampling intervals spanning its execution in more than one core over that period.

m4singleEflop2

Cluster frequency was a steady minimum of 1,050-1,060 MHz, with superimposed spikes when it rose briefly to the maximum of 2,592 MHz. This suggests that the single thread would most probably have been run at close to core minimum frequency, had there not been additional threads to run.

m4singleEflop3

A similar picture is seen in power use, with spikes from a low background of about 40-45 mW required by the single thread alone.

Single thread behaviours

These can be summarised as:

P core (high QoS) runs at 100% active residency on a single P core at maximum frequency, and is switched between clusters irregularly (about every 3.7-4.6 seconds). Total power use is about 2,500 mW.
High QoS spilt over to E cores runs at 90-100% active residency on a single E core at maximum frequency, and is either not switched between cores at all, or only infrequently.
E core (low QoS) runs at about 100% and is moved frequently between all E cores in the cluster, at close to minimum frequency. Total power use is about 40-45 mW.

Performance, power and efficiency

Although I’ll be returning to more detailed comparisons of performance and power use between P and E cores, I provide a single illustration here, for the in-core floating point task used above.

Running 2 x 10^9 loops in each thread, P cores at maximum frequency take 9.2-9.7 seconds per thread, and use about 2,500 mW per thread. E cores running low QoS threads at close to minimum frequency take about four times as long, 38.5 seconds, but use less than 45 mW power per thread. Total energy used to complete one thread is therefore over 23 J when run on P cores, and less than 1.7 J when run on E cores. E cores therefore use only 7% of the energy that P cores do performing the same task.

Key information

Current M4 chips feature 4-6 CPU E cores.
M4 E cores are arranged in a single cluster of 4 or 6, sharing L2 cache and running at a common frequency.
The E core cluster can be shut down (exceptionally), idling at their minimum frequency of 1,020 MHz, or at one of 6 set frequencies up to a maximum of 2,592 MHz, as controlled by macOS.
Their instruction set is the same as M4 P cores, ARMv9.2-A without its Scalable Vector Extension (SVE).
They use 40-45 mW when at low frequencies, but it’s not currently feasible to measure directly their maximum power use at high frequencies.
macOS allocates threads to E cores when their QoS is 9 (Background), and when a thread with higher QoS can’t be allocated to a P core because they are all busy. Management of frequencies and core allocation differ between those two cases.
High QoS threads on E cores are run at maximum frequency and appear not to move between cores.
Low QoS threads on E cores are run at close to minimum frequency and are highly mobile between cores.
Low QoS threads running on E cores run more slowly than higher QoS threads running on P cores, but E core power use is much lower, resulting in considerable saving in total energy use for the same computational task.

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM

Explainer

Inside M4 chips: P cores hosting a VM

The Eclectic Light Company

hoakley

13 November 2024 at 15:30

One common but atypical situation for any M-series chip is running a macOS virtual machine. This article explores how virtual CPU cores are handled on physical cores of an M4 Pro host, and provides further insight into their management, and thread mobility across P core clusters.

Unless otherwise stated, all results here are obtained from a macOS 15.1 Sequoia VM in my free virtualiser Viable, with that VM allocated 5 virtual cores and 16 GB of memory, on a Mac mini M4 Pro running macOS Sequoia 15.1 with 48 GB of memory, 10 P and 4 E cores.

How virtual cores are allocated

All virtualised threads are treated by the host as if they are running at high Quality of Service (QoS), so are preferentially allocated to P cores, even though their original thread may be running at the lowest QoS. This has the side-effect of running virtual background processes considerably quicker than real background threads on the host.

In this case, the VM was given 5 virtual cores so they could all run in a single P cluster on the host. That doesn’t assign 5 physical cores to the VM, but runs VM threads up to a total of 500% active residency across all the P cores in the host. If the VM is assigned more virtual cores than are available in the host’s P cores, then some of its threads will spill over and be run on host E cores, but at the high frequency typical of host threads with high QoS.

Performance

There is a slight performance hit in virtualisation, but it’s surprisingly small compared to other virtualisers. Geekbench 6.3.0 benchmarks for guest and host were:

CPU single core VM 3,643, host 3,892
CPU multi-core VM 12,454, host 22,706
GPU Metal VM 102,282, host 110,960, with the VM as an Apple Paravirtual device.

Some tests are even closer: using my in-core floating point test, 1,000 Mloops run in the VM in 4.7 seconds, and in the host in 4.68 seconds.

Host core allocation

To assess P core allocation on the host, an in-core floating point test was run in the VM. This consisted of 5 threads with sufficient loops to fully occupy the virtual cores for about 20 seconds. In the following charts, I show results from just the first 15 seconds, as representative of the whole.

vmpcoresm4pro0

When viewed by cluster, those threads were mainly loaded first onto the first P cluster (pale blue bars), where they ran for just over 1 second before being moved to the second cluster (red bars). They were then regularly switched between the two P clusters every few seconds throughout the test. Four cycles were completed in this section of the results, with each taking 2.825 seconds, so threads were switched between clusters every 1.4 seconds, the same time as I found when running threads on the host alone, as reported previously.

For most of the 15 seconds shown here, total active residency across both P clusters was pegged at 500%, as allocated to the VM in its 5 virtual cores, with small bursts exceeding that. Thus that 500% represents those virtual cores, and the small bursts are threads from the host. Although the great majority of that 500% was run on the active P cluster, a total of about 30% active residency consisted of other threads from the VM, and ran on the less active P cluster. That probably represents the VM’s macOS background processes and overhead from its folder sharing, networking, and other Virtio device use.

vmpcoresm4pro1

When broken down to individual cores within each cluster, seen in the first above and the second below, total activity differs little across the cores in the active cluster. During its period in the active cluster, each core had an active residency of 80-100%, bringing the cluster total to about 450% while most active.

vmpcoresm4pro2

In case you’re wondering whether this occurs on older Apple silicon, and it’s just a feature of macOS Sequoia, here’s a similar example of a 4-core VM running 3 floating point threads in an M1 Max with macOS 15.1, seen in Activity Monitor’s CPU History window. There’s no movement of threads between clusters.

vmm1maxtest1

P core frequencies

powermetrics, used to obtain this data, provides two types of core frequency information. For each cluster it gives a hardware active frequency, then for each core it gives an individual frequency, which often differs within each cluster. Cores in the active P cluster were typically reported as running at a frequency of 4512 MHz, although the cluster frequency was lower, at about 3858 MHz. For simplicity, cluster frequencies are used here.

vmpcoresm4pro3

This chart shows reported frequencies for the two P clusters in the upper lines. Below them are total cluster active residencies to show which cluster was active during each period.

The active cluster had a steady frequency of just below 3,900 MHz, but when it became the less active one, its frequency varied greatly, from idle at 1,260 MHz up to almost 4,400 MHz, often for very brief periods. This is consistent with the active cluster running the intensive in-core test threads, and the other cluster handling other threads from both the VM and host.

CPU power

Several who have run VMs on notebooks report that they appear to drain the battery quickly. Using the previous results from the host, the floating point test used here would be expected to use a steady 7,000 mW.

vmpcoresm4pro4

This last chart shows the total CPU power use in mW over the same period, again with cluster active residency (here multiplied by 10), added to aid recognition of cluster cycles. This appears to average about 7,500 mW, only 500 mW more than expected when run on the host alone. That shouldn’t result in a noticeable increase in power usage in a notebook.

In the previous article, I remarked on how power used appeared to differ between the two clusters, and this is also reflected in these results. When the second cluster (P1) is active, power use is less, at about 7,100 mW, and it’s higher at about 7,700 when the first cluster (P0) is active. This needs to be confirmed on other M4 Pro chips before it can be interpreted.

Key information

macOS guests perform almost as well as the M4 Pro host, although multi-core benchmarks are proportionate to the number of virtual cores allocated to them. In particular, Metal GPU performance is excellent.
All threads in a VM are run as if at high QoS, thus preferentially on host P cores. This accelerates low QoS background threads running in the VM.
Virtual core allocation includes all VM overhead from the VM, such as its macOS background threads.
Guest threads are as mobile as those of the host, and are moved between P clusters every 1.4 seconds.
Although threads run in a VM incur a small penalty in additional power use, this shouldn’t be significant for most purposes.
Once again, evidence suggests that the first P cluster (P0) in an M4 Pro uses slightly more power than the second (P1). This needs to be confirmed in other systems.
powermetrics can’t be used in a VM, not unsurprisingly.

Inside M4 chips: P cores

Explainer

Inside M4 chips: P cores

The Eclectic Light Company

hoakley

11 November 2024 at 15:30

This is the first in a series diving deeper into Apple’s new M4 family of chips. This starts with details of its Performance (P) cores. Comparisons of their performance against cores in earlier M-series chips will follow separately when I have completed them.

M4 family

There are currently three M4 designs:

Base M4, with 4 P and 6 E cores, also available in a cheaper variant with only 4 active E cores, and a ‘binned’ variant for iPads with only 3 active P cores.
M4 Pro, with 10 P and 4 E cores, also available in a ‘binned’ variant with only 8 active P cores.
M4 Max, with 12 P and 4 E cores, also available in a ‘binned’ variant with only 10 active P cores.

Apple is expected to release an Ultra variant in 2025, consisting of two M4 Max chips connected and working in tandem, providing a total of 24 P and 8 E cores.

Apart from the number of cores in each design, their caches and memory, all P cores are the same, and different from E cores.

P core architecture

All CPU cores are arranged in clusters of up to 6. All cores within any given cluster share L2 cache, and are run at the same frequency (clock speed). The Base M4 has a single cluster of 4 P cores, while the Pro and Max have two clusters of 5 and 6 cores respectively.

Frequency

A prominent feature of both P and E cores is their variable frequency (clock speed). In the case of P cores, this can be set to any of 17 values between the minimum of 1,260 MHz and maximum of 4,512 MHz (1.3-4.5 GHz). When running macOS, cluster frequencies are set by macOS at a kernel level; other operating systems may offer more direct control.

P cores idle at 1,260 MHz, but can also be shut down altogether. Previous M-series chips have been reported by the powermetrics command tool as sometimes being idle at a frequency of 0 MHz, but the M4 is the first to have idle and down states reported separately, for example:
CPU 4 active residency: 0.00% CPU 4 idle residency: 0.00% CPU 4 down residency: 100.00%
when that core and its whole cluster are shut down rather than just idling. It’s not clear whether this is merely an administrative change, or M4 cores implement this state differently from previous cores.

Instruction set

There’s confusion over the Instruction Set Architecture (ISA) supported by M4 cores. This is explained in the LLVM source, where it’s claimed that they’re “technically” ARMv9.2-A, but without its Scalable Vector Extension (SVE). Some might consider that’s closer to ARMv8.7-A, one version more recent than the M3’s ARMv8.6-A.

Although this is now fully supported in LLVM clang, it’s not clear how fully it’s supported by Xcode, for example.

Power

When shut down, a P core consumes no power, of course, and at idle with no active residency, it uses only 1-2 mW, according to measurements reported by powermetrics.

Maximum power consumption rises to approximately 1,400 mW when running intensive floating point calculations at 100% active residency, and to approximately 3,230 mW when running NEON vector instructions at 100% active residency.

macOS core allocation

Threads are normally allocated by macOS to an available P core when their designated Quality of Service (QoS) is higher than 9 (Background), for example when using Dispatch, formerly branded Grand Central Dispatch (GCD). Running threads may also be moved periodically between P cores in the same cluster, and between clusters. Previous M-series chips appear to move threads less frequently, and may leave them to run to completion after several seconds on the same core, but threads appear to be considerably more mobile when running on M4 P cores.

VM4core4threadPcoresARes

This bar chart shows 4 threads from 4 virtual CPUs in a VM running for 3 seconds at 100% active residency. For almost all that period, the threads remain running on the 4 physical cores of the first P cluster in this M1 Max, with the second P cluster remaining idle for much of that time.

The following charts show 4 threads of intensive in-core floating point arithmetic running on the P cores of an M4 Pro.

m4threads1clusters

When viewed by cluster, those threads are loaded first onto the second P cluster (red bars), where they run for 0.4 seconds before being moved to the first cluster (pale blue bars). After running there for 1.3 seconds, they’re moved back to the second cluster for a further 1.3 seconds, before completing on the first cluster.

The next two bar charts show each cluster separately, illustrating thread mobility within them.

m4threads2cluster1

When running on the first cluster (above), threads appear to be moved to a different core approximately every 0.3 second, as they do when on the second cluster (below).

m4threads3cluster2

m4threads4frequency

Cluster frequency matches this movement, with each cluster being run up to maximum frequency (shown here averaged across the whole cluster) to process the threads running on its cores. The black line below those for the P clusters shows the small changes in average frequency for the E cluster over this period.

m4threads5power

This last chart shows the total CPU power use in mW over the same period. Of particular interest here is the consistent difference in power use reported by powermetrics between the two P clusters: the first (P0) used a steady 6,000 mW when running these four threads, whereas the second (P1) used slightly less, at 5,700-5,800 mW. That could be the result of measurement error in powermetrics, peculiar to this particular chip, or could reflect an underlying difference between the two clusters.

Thread mobility makes interpreting CPU History in Activity Monitor difficult, as the fastest frequency of sampling available there is every second, while powermetrics was sampling every 0.1 second when gathering the data above. As groups of threads may be moved between clusters every 1.3 seconds or so, this can give the impression that threads are being run on both clusters simultaneously. Once again, great care is needed when interpreting the data shown by Activity Monitor.

Key information

Current M4 chips offer 4-12 CPU P cores.
M4 P cores are arranged in clusters of up to 6, sharing L2 cache and running at a common frequency.
P core clusters can be shut down, idling at their minimum frequency of 1,260 MHz, or at one of 18 set frequencies up to a maximum of 4,512 MHz, as controlled by macOS.
Their instruction set is “technically” ARMv9.2-A, but without its Scalable Vector Extension (SVE).
They use 1-2 mW when idle, rising to peaks of 1,400 mW (floating point) or 3,230 mW (NEON vector code).
macOS preferentially allocates them threads at all QoS higher than 9 (Background).
Threads running on M4 P cores are mobile, and may be moved to another core in the same cluster frequently, and after just over a second may be transferred to a core in the other P cluster, when available.
Thread mobility makes interpretation of the CPU History window in Activity Monitor very difficult.

Explainer

Last Week on My Mac: Mac mini M4 Pro first impressions, cores and more

The Eclectic Light Company

hoakley

10 November 2024 at 16:00

My Mac mini M4 Pro arrived four days early, and swiftly exceeded all expectations. Everyone who has seen this new design remarks on how tiny it is, and I still keep looking at it, wondering how the smallest Mac I have ever owned is also by far the fastest. For those who collect numbers, it matched the growing collection in Geekbench’s database, with a single-core score of 3,892, a multi-core of 22,706, and Metal GPU of 110,960. Of course those new MacBook Pros with M4 Max chips are doing better with their extra two CPU cores and twice the number of GPU cores, but second best is still far superior to anything I’ve run before.

Not only is this my fastest Mac ever, but it was also the quickest and simplest to commission. It replaces my Mac Studio M1 Max, and now sits under my Studio Display using the Studio’s old keyboard and trackpad. As I’ll explain in a future article, I opted to migrate to the mini during initial setup using the Studio’s backup SSD, a sleek OWC Envoy Pro SX connected by Thunderbolt 3. Including the inevitable macOS update, the whole process took less than two hours, from the start of unboxing to the first full Time Machine backup.

With 48 GB memory and an internal SSD of 2 TB, I was keen to benchmark that once its initial backup was done. Although I have reduced confidence in these figures, there’s no doubt that this SSD is significantly quicker than the 2 TB in the Studio; how much quicker is open to debate. AmorphousDiskMark reported a write speed of 7.7 GB/s, while my own Stibium gave 10.3 GB/s across a broad range of file sizes. There was closer agreement on read speed, at around 6.8 GB/s. Both of those are comfortably above what I expect when I can finally get hold of an external Thunderbolt 5 SSD.

CPU cores

My M4 Pro is the full-performance variant, with 10 P and 4 E cores, set in three clusters: an E cluster with all four E cores, and two clusters of five P cores. Since its M3 chips, Apple has increased the maximum number of cores in a cluster from 4 (in M1 and M2 chips) to 6 (M3 and M4).

M4 P cores run at frequencies between 1260 and 4512 MHz (1.3-4.5 GHz), their maximum frequency being 111% that of the M3 P core, and 140% that of the M1 P core. M4 E cores have a narrower range of frequencies than those in the M3, between 1020-2592 MHz (1.0-2.6 GHz), so running at 137% of M3 frequency when running low QoS threads, and 94% of M3 when running at maximum frequency for high QoS threads spilt over from P cores. Those are going to make comparisons between E core performance very interesting indeed.

Single-thread and single-core comparisons of P core throughput are inevitably in the M4’s favour, with significantly better performance across integer, floating point and vector performance, as shown in the chart below.

M4M3multiTests

The Y axis here gives loop throughput per second for my four basic in-core performance tests, a tight assembly code integer math loop, another tight assembly code loop of floating point math, NEON vector processor assembly code, and a tight loop calling an Accelerate routine run in the NEON unit. Pale blue bars are results for the M1, purple for the M3, and red for the shiny new M4.

Although improvements in integer and floating point performance might appear small here, these results are for single threads. When scaled up across more P cores, the differences are magnified, as shown in the regressions below.

M4vM3floatTimevThreads

This plots total times taken to execute multiple threads, each consisting of 10^9 (one thousand million, or one billion) loops of floating point assembly code, against the number of threads, the same as the number of cores used, as each core runs one thread. According to the fitted linear regression equations, each thread/core of 10^9 loops takes the same period: 6.04 seconds for the M3 P cores, and 4.69 for the M4. Those equate to 1.66 x 10^8 loops/second per thread for the M3, and 2.13 x 10^8 for the M4. On that basis, the M4 performs at 128% that of the M3, significantly better than expected by the 111% increase in maximum core frequency. That requires the M4 to have improved processing speed for the same frequency of the M3, a frequency-independent improvement.

Core allocation

One striking difference between M4 CPU cores and those of previous Apple silicon chips is their pattern of core allocation, something you may notice even in Activity Monitor, for all its faults. This is best illustrated in its CPU History window.

For these two series of tests, I waited until the Mac was idling quietly, with little happening on its CPU cores. I then ran, in rapid succession, a series of my in-core floating point tests of 10^9 loops at high QoS, starting with a single thread, then two, and so on up to six or eight threads. This results in a pattern distinct from anything you’ll see in Intel cores, as I have shown here since the early days of the M1.

m3profloptcompo

This is the series seen on my MacBook Pro M3 Pro running macOS 15.2 beta, and is typical of M1, M2 and M3 chips. Although the cores are out of order here, these are the six cores in the chip’s single P cluster. At the left is the obvious peak from 1 thread running on Core 9, then the 2 threads of the next test appear on cores 9 and 12. Three fully occupy 7, 9 and 12, with a little spilt onto 8, and so on until all six are fully occupied with 6 threads.

Activity Monitor is a little too crude to see how distinct this is, for which you have to resort to shorter sampling periods of 0.1 second, and the finer detail of active residency reported by powermetrics. Those confirm that, much of the time, these heavy in-core compute loads run in a single thread on a single core, and if they are moved around, it’s relatively infrequently. E cores are different, though.

Compare that with a similar sequence, this time for 1-8 threads, on the M4 Pro. Its ten P cores are divided into two clusters, which I’ve separated here with the blue line, although within each cluster the cores aren’t in numeric order.

m4profloptcompo

Reading again from the left, a single thread is run in cores 6 and 8, with a little in 14 in the second cluster. Two threads are run in all five cores of the second cluster, with a little at the end from those of the first cluster. Three threads are similar, with significant contributions from all ten cores, and from then on they are similar.

So, in the M4 P cores threads are no longer allocated to single cores, and it appears from the CPU History window that they may even be allocated across more than one cluster. In fact, when the detailed active residency and frequency data from powermetrics are used, while threads are often moved between two cores in the same cluster, macOS will normally avoid unnecessarily running threads on other clusters, although it will happily move all threads from one cluster to another.

The rationale behind running all threads within the same cluster is clear, as CPU core frequencies of all cores in the same cluster are the same. Constraining threads to a single cluster thus allows others to idle and use less energy. While that continues in M4 chips, it isn’t clear why threads are moved between cores so frequently, or moved together to another cluster, and whether that brings improvements in performance.

Why % CPU in Activity Monitor isn’t what you think

The Eclectic Light Company

hoakley

8 November 2024 at 15:30

One of the most frequently quoted measurements in Activity Monitor is % CPU. When the fans blow, or a Mac becomes sluggish, it’s often the first figure to look at and wonder why that process is pulling over 100%. That’s the first thing you learn: here, per cent doesn’t mean out of 100, but the sum of the real percentage for each core. As this Mac has eight cores, that allows it up to 800%, until you realise that Intel processors can use Hyper-Threading to pretend that each is two cores, making a maximum of 1,600%, not bad for a single processor. Then you come completely unstuck with Apple silicon.

Apple defines % CPU as “the percentage of CPU capability that’s being used”, a phrase that doesn’t appear to have any definition either in Apple’s documentation or elsewhere. So like everyone else, you assume that 100% for a core represents its maximum processing capacity. Only you couldn’t be more incorrect: in truth, the number given as % CPU is nothing like that.

Tests

To examine this more clearly, I have an app that runs tight loops of assembly code to load CPU cores on Apple silicon chips and measure their performance. For this, I got it to run several threads, each consisting of 500 million loops of floating point calculations. By running those at different Quality of Service (QoS) settings, macOS will send them to different core types.

When I run two threads at low QoS so they’re run only on E cores of a MacBook Pro M3 Pro, each takes 10.2 seconds to complete. If I instead run eight threads at high QoS, the first six of those will be run on the P cores, but the remaining two spill over onto two of the E cores, where they complete in just 3.0 seconds each.

Yet, according to % CPU in Activity Monitor, the two threads on the E cores come to 200%, and the two from eight run in the second test also come to 200%. You can see this in the CPU History window.

m3activitymon

Here are the 6 E cores at the top, and 6 P cores below. ① marks the first test on two E cores, and shows the total of 200% spread across all six E cores over the 10.2 seconds it took to complete. ② marks the second test, with all 6 P cores and two E cores hitting 100% each, for a total of 800%, as shown in Activity Monitor’s window.

By now I’m sure you guessing how running the same code on the E cores could vary in speed by a factor of 3.4 (10.2 against 3.0 seconds) although they’re the same % CPU: that measurement doesn’t take into account core frequency.

Frequency

Unlike traditional Intel CPUs, CPU cores in Apple silicon chips can be run at a wide range of frequencies, as set by macOS. To make this frequency control a bit simpler, cores are grouped into clusters, that also share L2 cache, and within each cluster all cores run at the same frequency. The M3 Pro chip consists of two clusters, one of 6 E cores, the other of 6 P cores.

When macOS loads two of those E cores with low QoS threads, it sets them to run at a low frequency to make the most of their energy efficiency. When it loads two E cores with threads that were intended to be run on P cores, as they have a high QoS, it runs the E cluster up to high frequency, so they perform better.

Measuring frequency of CPU cores is straightforward using the powermetrics command tool. When those threads were being run on E cores at low QoS, frequency was held at their minimum of 744 MHz, but run at high QoS their frequency was set to maximum, at 2748 MHz, 3.7 times faster. If full load at maximum frequency is 100% of their capability, then at low QoS they were really running at 27%, not the 100% given by Activity Monitor, and that largely accounts for the difference in time that those threads took to complete their floating point calculations.

What is % CPU?

What Activity Monitor actually shows as % CPU or “percentage of CPU capability that’s being used” is what’s better known as active residency of each core, that’s the percentage of processor cycles that aren’t idle, but actively processing threads owned by a given process. But it doesn’t take into account the frequency or clock speed of the core at that time, nor the difference in core throughput between P and E cores.

The next time that you open Activity Monitor on an Apple silicon Mac, to look at % CPU figures, bear in mind that while those numbers aren’t completely meaningless, they don’t mean what you might think, and what’s shown as 100% could be anything between 27-100%.

Last Week on My Mac: M4 incoming

The Eclectic Light Company

hoakley

3 November 2024 at 16:00

Almost exactly a year after it released its first Macs featuring chips in the M3 family, Apple has replaced those with the first M4 models. Benchmarkers and core-counters are now busy trying to understand how these will change our Macs over the coming year or so. Before I reveal which model I have ordered, I’ll try to explain how these change the Mac landscape, concentrating primarily on CPU performance.

CPU cores

CPUs in the first two families, M1 and M2, came in two main designs, a Base variant with 4 Performance and 4 Efficiency cores, and a Pro/Max with 8 P and 2 or 4 E cores, that was doubled-up to make the Ultra something of a beast with its 16 P and 4 or 8 E cores. Last year Apple introduced three designs: the M3 Base has the same 4 P and 4 E CPU core configuration as in the M1 and M2 before it, but its Pro and Max variants are more distinct, with 6 P and 6 E in the Pro, and 10-12 P and 4 E cores in the Max. The M4 family changes this again, improving the Base and bringing the Pro and Max variants closer again.

As these are complicated by sub-variants and binned versions, I have brought the details together in a table.

mcorestable2024

I have set the core frequencies of the M4 in italics, as I have yet to confirm them, and there’s some confusion whether the maximum frequency of the P core is 4.3 or 4.4 GHz.

Each family of CPU cores has successively improved in-core performance, but the greatest changes are the result of increasing maximum core frequencies and core numbers. One crude but practical way to compare them is to total the maximum core frequencies in GHz for all the cores. Strictly speaking, this should take into account differences in processing units between P and E cores, but that also appears to have changed with each family, and is hard to compare. In the table, columns giving Σfn are therefore simply calculated as
(max P core frequency x P core count) + (max E core frequency x E core count)

Plotting those sum core frequencies by variant for each of the four families provides some interesting insights.

mcoresbars2024

Here, each bar represents the sum core frequency of each full-spec variant. Those are grouped by the variant type (Base, Pro, Max, Ultra), and within those in family order (M1 purple, M2 pale blue, M3 dark blue, M4 red). Many trends are obvious, from the relatively low performance expected of the M1 family, except the Ultra, and the changes between families, for example the marked differences in the M4 Pro, and the M3 Max, against their immediate predecessors.

Sum core frequencies fall into three classes: 20-30, 35-45, and greater than 55 GHz. Three of the four chips in the M1 family are in the lowest of those, with only the M1 Ultra reaching the highest. The M4 is the first Base variant to reach the middle class, thanks in part to its additional two E cores. Two of the M4 variants (Pro and Max) have already reached the highest class, and any M4 Ultra would reach far above the top of the chart at 128 GHz.

Real-world performance will inevitably differ, and vary according to benchmark and app used for comparison. Although single-core performance has improved steadily, apps that only run in a single thread and can’t take advantage of multiple cores are likely to show little if any difference between variants in each family.

Game Mode is also of interest for those considering the two versions of the M4 Base, with 4 or 6 E cores. This is because that mode dedicates the E cores, together with the GPU, to the game being played. It’s likely that games that are more CPU-bound will perform significantly better on the six E cores of the 10-Core version of the iMac, which also comes with a 10-core GPU and four Thunderbolt 4 ports.

Memory and GPU

Memory bandwidth is also important, although for most apps we should assume that Apple’s engineers match that with likely demand from CPU, GPU, neural engine, and other parts of the chip. There will always be some threads that are more memory-bound, whose performance will be more dependant on memory bandwidth than CPU or GPU cores.

Although Apple claims successive improvements in GPU performance, the range in GPU cores has started at 8 and attained 32-40 in Max chips. Where the Max variants come into their own is support for multiple high-res displays, and challenging video editing and processing.

Thunderbolt and USB 3

The other big difference in these Macs is support for the new Thunderbolt 5 standard, available only in models with M4 Pro or M4 Max chips; Base variants still only support Thunderbolt 4. Although there are currently almost no Thunderbolt 5 peripherals available apart from an abundant supply of expensive cables, by the end of this year there should be at least one range of SSDs and one dock shipping.

As ever with claimed Thunderbolt performance, figures given don’t tell the whole story. Although both TB4 and USB4 claim ‘up to’ 40 Gb/s transfer rates, in practice external SSD performance is significantly different, with Thunderbolt topping out at about 3 GB/s and USB4 reaching up to 3.4 GB/s. In practice, TB5 won’t deliver the whole of its claimed maximum of 120 Gb/s to a single storage device, and current reports are that will only achieve disk transfers at 6 GB/s, or twice TB4. However, in use that’s close to the expected performance of internal SSDs in Apple silicon Macs, and should make booting from a TB5 external SSD almost indistinguishable in terms of speed.

As far as external ports go, this widens the gap between the M4 Pro Mac mini’s three TB5 ports, which should now deliver 3.4 GB/s over USB4 or 6 GB/s over TB5, and its two USB-C ports that are still restricted to USB 3.2 Gen 2 at 10 Gb/s, equating to 1 GB/s, the same as in M1 models from four years ago.

My choice

With a couple of T2 Macs and a MacBook Pro M3 Pro, I’ve been looking to replace my original Mac Studio M1 Max. As it looks likely that an M4 version of the Studio won’t be announced until well into next year, I’m taking the opportunity to shrink its already modest size to that of a new Mac mini. What better choice than an M4 Pro with 10 P and 4 E cores and a 20-core GPU, and the optional 10 Gb Ethernet? I seldom use the fourth Thunderbolt port on the Studio, and have already ordered a Kensington dock to deliver three TB5 ports from one on the Mac, and I’m sure it will drive my Studio Display every bit as well as the Studio has done.

If you have also been tempted by one of the new Mac minis, I was astonished to discover that three-year AppleCare+ for it costs less than £100, that’s two-thirds of the price that I pay each year for AppleCare+ on my MacBook Pro.

I look forward to diving deep into both my new Mac and Thunderbolt 5 in the coming weeks.

Reading view

I/O Throttling

Performance checks

Changing policy

SSD or hard disk?

taskpolicy

Virtualisation

Game Mode, CPU cores

Summary

Explainer

Consistency in write speed

Band size

Streaming

Conclusions

Previous articles

Methods

Disk write speeds

Compression rates

Conclusions

How Activity Monitor works

% CPU

Threads and more

GPU and co-processors

Setting up

Extras

Explainers

Activity Monitor

Timing performance

Controlling threads

Analysis

Tuning performance

Origins

Architecture

Direct host connections

Docks and hubs

Who needs TB5?

Benchtests

Objective measurements

Identification

Goal

Scope

Which type of core?

Which cluster?

Which core within the cluster?

What frequency?

Thread mobility

Summary

Previous articles

Explainer

Climbing a mountain

Compressing a file

Hosting virtual cores

Set frequencies

Why?

Key information

Previous articles

Explainer

Acknowledgements

Methods

Power used by thread

Execution time

Energy use

The role of frequency

Power Modes

Key information

Previous articles

Appendix: Source code

Methods

Power used by thread

Execution time

Energy use

Key information

Previous articles

Appendix: Source code

Methods

P core performance

E core slow performance

E core fast performance

M4 P and E comparison

Key information

`taskpolicy`