Reading view

There are new articles available, click to refresh the page.

Last Week on My Mac: Compression models

15 December 2024 at 16:00

Over the fifty years since I started using computers, one of the greatest changes has been who does the waiting. In my early days, it was me, in some cases waiting overnight for batches to be processed on a mainframe computer. By the mid 1990s I had to wait a couple of days for some of my optimisation runs to complete on the Mac in my office. Since then, as Macs have got progressively faster, they have come to do the waiting for me. Whether I’m editing an image or writing an article, my Mac keeps pace with almost everything I do.

One notable if infrequent exception is when there’s a macOS update. On a newer Apple silicon model, installation itself is far quicker than when Apple changed to using the SSV in Big Sur, but there’s always that wait for the downloaded update to be “prepared” in the background. Much of that is thought to be decompression, which minimises the size of downloads by putting the burden on the Mac to expand and organise what needs to be installed.

Data compression is widely used in modern operating systems. Take a look in Activity Monitor’s Memory view, and at the bottom right you’ll see how much compressed memory is in use, here currently over 1 GB. It’s used to save space in macOS, where many system files are stored in compressed formats. Compression is also an integral part of many media formats, whether still or moving images, or audio. Unless they’re compressed, we’d only be able to enjoy a small fraction of the media we now consume so voraciously.

It’s understandable that compression is thus one of the best-studied tasks in computing. Over decades of intensive research and practical experience, we have learned that the most powerful methods of compressing data can also take the most processing time. The effectiveness of compression is generally expressed in terms of compression ratio, the ratio of original file size to that of the compressed file. Thus all effective compression methods should result in a compression ratio of more than 1. You’ll also come across percentage space saving, the reduction in size compared to the original. If a compression method shrinks a 10 GB file to 5 GB, then its compression ratio is 10/5 = 2, and its space saving is (1 – 5/10) x 100 = 50%.

One of the most efficient compression techniques has been that in nanozip 0.09a, for instance, which can achieve a compression ratio of 5.3, or a space saving of 81%, on English text taken from Wikipedia. This illustrates the trade-offs made in such efficient techniques, as to achieve that it takes almost 128 seconds to compress 100 MB, a rate of 0.783 MB/s. That makes it of little everyday use and unsuitable for macOS updates, where it also couldn’t achieve the same high compression ratio because the data are binary rather than English text.

Optimising compression isn’t as simple as it might appear. There’s an inverse relationship between the time spent compressing data and the amount of compressed data to be written, as there is with many similar tasks. The more sophisticated the compression and the higher the compression ratio, the less data there is to be written out, although there’s still the same amount of data to be read. Using the example of a compression ratio of 2, 10 GB of input data has to be read and only 5 GB written out, so unless write speed is less than half read speed, reading the input data is likely to be more rate-limiting.

Compression is also representative of many other types of everyday tasks, as it consists of three sub-tasks:

read data from disk
process it (encode, decode, transcode)
write the transformed data to disk

Unless you use small file sizes, this is usually performed on data streamed from storage and back to storage, rather than on single chunks of data in memory.

To understand what limits performance of any of those everyday tasks, we need to understand how each of its sub-tasks is limited, and how those limitations interact. In some cases, such as preparation of macOS updates, time taken appears to be determined by the processing required in the second step. In the past, Apple has run all that in a single thread to minimise impact on concurrent use of other apps in the foreground. In other tasks, such as encrypting files, stage 2 processing should be fastest, making it most likely that disk access is limiting.

Although we can’t apply conclusions drawn from studying any specific compression method, we can use it as a model to develop tools that we can then apply to other tasks. Over the last week, I’ve published two articles here discussing how to do that.

In the first I asked how we can investigate whether more threads will run a task faster. We know that for my compression task they do, but that may not be true for the tasks you’re most interested in accelerating. In rare cases, an app may provide a control to set how many threads to use, and it’s easy to use that to investigate. It’s more likely that your app has no such control, but one way around that is to run it in a virtual machine, where you can pick the number of virtual cores to host that VM.

Timing the performance of your app with test files of known size should give a performance rate, and that can help identify which of those sub-tasks is probably limiting overall performance. If your app transcodes video, for example, and you time how long it takes to complete all three sub-tasks on a 10 GB video clip, you can work out its transcoding rate in GB/s. If that test clip takes 20 seconds to read that test file, transcode it, and write out the converted video, then its overall transcoding rate is 0.5 GB/s.

The next article considers how you might work out whether running that task on high-speed storage would improve that overall performance. Although it looks unlikely that a transcoding rate of 0.5 MB/s would be altered by running it on a Thunderbolt 3 disk with its 2.3 GB/s write speed, it’s worth comparing performance on the faster internal SSD to check that, and decide whether a faster USB4 SSD might be better.

By the time that you’ve worked through those tests, you should have better insight into which of those three sub-tasks is limiting performance, how you might be able to improve them, and compare current performance in the overall transcoding rate against attempts at improvement. As I wrote last week, the only way to do that is with objective measurements such as that transcoding rate.

Next week I’ll continue this series by considering how you can control the type of CPU core threads run on, and some related issues about sparse bundles and the performance of Time Machine backup storage.

Tune for Performance: Do you really need a big internal SSD?

The Eclectic Light Company

hoakley

12 December 2024 at 15:30

The most common economy that many make when specifying their next Mac is to opt for just 512 GB internal storage, and save the $/€/£600 or so it would cost to increase that to 2 TB. After all, if you can buy a 2 TB Thunderbolt 5 SSD for around two-thirds of the price, why pay more? And who needs the blistering speed of that expensive internal storage when your apps work perfectly well with a far cheaper SSD? This article considers whether that choice matters in terms of performance.

To look at this, I’m going to use the same compression task that I used in this previous article, my app Cormorant relying on the AppleArchive framework in macOS to do all the heavy lifting. As results are going to differ considerably when using other apps and other tasks, I’d like to make it clear that this can’t reach general conclusions that apply to every task and every app: your mileage will vary. My purpose here is to show how you can work out whether using a slower disk will affect your app’s performance.

Methods

In that previous article, I concluded that compression speed was unlikely to be determined by disk performance, as even with all 10 P cores running compression threads, that rate only reached 2.28 GB/s, around the same as write speed to a good Thunderbolt 3 SSD. For today’s tests, I therefore set Cormorant to use the default number of threads (all 14 on my M4 Pro) so it would run as fast as the CPU cores would allow when compressing my standard 15.517 GB IPSW test file.

I’m fortunate to have a range of SSDs to test, and here use the 2 TB internal SSD of my Mac mini M4 Pro, an external USB4 enclosure (OWC Express 1M2) with a 2 TB Samsung 990 Pro SSD, and an external 2 TB Thunderbolt 3 SSD (OWC Envoy Pro SX). As few are likely to have access to such a range, I included two disk images stored on the internal SSD, a 100 GB sparse bundle, and a 50 GB read-write disk image, to see if they could be used to model external storage. All tests used unencrypted APFS file systems.

The first step with each was to measure its write speed using Stibium. Unlike more popular benchmarking apps for the Mac, Stibium measures the speed across a wide range of file sizes, from 2 MB to 2 GB, providing more insight into performance. After those measurements, those test files were removed, the large test file copied to the volume, and compressed by Cormorant at high QoS with the default number of threads.

Disk write speeds

compressionbydisk1

Results in each test followed a familiar pattern, with rapidly increasing write speeds to a peak at a file size of about 200 MB, then a steady rate or slow decline up to the 2 GB tested. The read-write disk image was a bit more erratic, though, with high write speeds at 800 and 1000 MB.

At 2 GB file size, write speeds were:

7.69 GB/s for the internal SSD
7.35 GB/s for the sparse bundle
3.61 GB/s for the USB4 SSD
2.26 GB/s for the Thunderbolt 3 SSD
1.35 GB/s for the disk image.

Those are in accord with my many previous measurements of write speeds for those types of storage.

Compression rates

Times to compress the 15.517 GB test file ranked differently:

5.57 s for the internal SSD
5.84 s for the USB4 SSD
9.67 s for the sparse bundle
10.49 s for the Thunderbolt 3 SSD
16.87 s for the disk image.

When converted to compression rates, sparse bundle results are even more obviously an outlier, as shown in the chart below.

compressionbydisk2

There’s a roughly linear relationship between measured write speed and compression rate in the disk image, Thunderbolt 3 SSD, and USB4 SSD, and little difference between the latter and the internal SSD. These suggest that disk write speed becomes the rate-limiting factor for compression when write speed falls below about 3 GB/s, but above that faster disks make little difference to compression rate.

Poor performance of the sparse bundle was a surprise, given how close its write speeds are to those of the host internal SSD. This is probably the result of compression writing a single very large file across its 14 threads; as the sparse bundle stores file data on a large number of band files, their overhead appears to have got the better of it. I will return to look at this in more detail in the near future, as sparse bundles have become popular largely because of their perceived superior performance.

The difference in compression rates between USB4 and Thunderbolt 3 SSDs is also surprisingly large. Of course, a Mac with fewer cores to run compression threads might not show any significant difference: a base M3 chip with 4 P and 4 E cores is unlikely to achieve a compression rate much in excess of 1.5 GB/s on its internal SSD because of its limited cores, so the restricted write speed of a Thunderbolt 3 SSD may not there become the rate-limiting factor.

Conclusions

The rate-limiting step in task performance will change according to multiple factors, including the effective use of multiple threads on multiple cores, and disk performance.
There’s no simple model you can apply to assess the effects of disk performance, and tests using disk images can be misleading.
You can’t predict whether a task will be disk-bound from disk benchmark performance.
Even expensive high-performance external SSDs can result in noticeably poor task performance. Maybe that money would be better spent on a larger internal SSD after all.

Tune for Performance: do more threads run faster?

The Eclectic Light Company

hoakley

10 December 2024 at 15:30

One of the most distinctive features about modern Macs is that they have multiple cores, in Apple silicon models a minimum of eight. For an app to be able to make use of more than one core (or its equivalent) at a time it needs to divide its processing into threads, discrete blocks of code that can be run on different cores by macOS. If it doesn’t do that, then running that app on a high-end Pro, Max or Ultra chip is unlikely to be significantly faster than on a base model (assuming that task is CPU-bound).

Activity Monitor

You might be able to get a good idea as to how well an app makes use of multiple cores from watching it in use in Activity Monitor’s CPU History window, but in many cases that isn’t conclusive, and needs to be confirmed.

To illustrate ways to tackle this, I take the example of file compression. Some methods lend themselves to multiple threads better than others, and some may be implemented in a way that won’t accelerate when several cores are available. For the occasional user this might make little difference, but if you were to spend much of your day waiting for 10-100 GB files to compress, it merits a little exploration.

polycore1

This CPU History window from a compression task run on an M4 Pro is typically unhelpful. A single file compression is seen in the group of 4-5 peaks in CPU in the right half of each trace. Although compression used all ten P cores, it appears to have been moved between P clusters, seen by comparing the timing of peaks on Cores 8 and 10, 9 and 11. At no time does the task appear to exceed 50% active residency on any of the cores, though. All you can do is guess as to what might be going on, and whether it might run faster on twelve or more P cores.

polycore2

On another run, the same task is reported to have reached a peak of 500% CPU in 12 threads, but would it exceed that on more P cores, given that 10 were available?

Timing performance

As usual, I’ve been cheating a little to generate those results, by using my simple compression-decompression app Cormorant, in which I control how many threads it uses. If the app you’re trying to investigate offers a similar feature, then you can set up a standard task, here the compression of a 15.517 GB IPSW file, and time how long it takes using different numbers of threads.

polycore3

The answers from Cormorant, which conveniently performs its own timing, are:

1 thread takes 49.32 seconds
2 take 26.74 s
3 take 18.60 s
4 take 14.29 s
5 take 11.21 s.

So Cormorant’s compression can make good use of more P cores, although with diminishing returns.

Very few apps give you this level of control, though. The only other compression utility that does appear to is Keka.

polycore4

In its settings, you can give its tasks a maximum number of threads, and even run them at custom Quality of Service (QoS) if you want them to be run in the background on E cores, and not interrupt your work on P cores.

Controlling threads

There is one way that you can limit the effective number of threads used by an arbitrary app, and that’s to run it in a Virtual Machine, as you control the number of virtual cores that it uses. While you can just about run a macOS VM on a single core alone, I suggest that a more workable starting point is two cores, and you can increase that number up to the total of P cores in the host without causing problems.

VMs have other virtues, including their relative lack of background processes, allowing their virtual cores to be almost entirely devoted to your test task. However, as they can’t run apps from the App Store other than Apple’s free suite of Pages, Numbers and Keynote, that could prevent you from using them for testing.

To set up a VM for thread tests, I duplicated a standard Sonoma 14.7.1 VM so I could throw it away at the end, opened it with five virtual cores, and copied over the test app Cormorant and file. I then closed that VM down, set it to use 2 virtual cores, opened it and ran my test. I repeated that with an increasing number of virtual cores up to a total of 5. Compression times are:

2 vCPUs take 30.13 seconds
3 take 21.36 s
4 take 17.18 s
5 take 13.59 s.

Those are only slightly slower than their equivalents from the host tests above.

Analysis

polycore5

Plotting those results out using DataGraph, the lines of best fit follow power laws:

for real cores, time = 49.8863/(T^0.91169)
for virtual cores, time = 54.737/(T^0.854318)

where T is the number of threads. That explains the apparently diminishing returns with increasing numbers of threads, although the maths isn’t as simple as we might like.

A better way to look at this is by calculating the rate of compression in GB/s, simply by dividing the file size of 15.517 GB by each time. Here we end up with straight lines from linear regression, that are more amenable to thought.

polycore6

Those regressions are:

for real cores, rate of compression = 0.0464 + (0.264 x T)
for virtual cores, rate of compression = 0.102 + (0.206 x T)

which are more generally useful.

Tuning performance

This might appear over-elaborate and of little practical use, but we now have a much better understanding of the factors limiting compression performance:

The more threads compression uses, the shorter time a task will take.
There are limits to that improvement, though, when substantially more than 5 threads are used.
Compression rates achieved in M4 P cores are significantly lower than read or write speeds of the internal SSD, so aren’t likely to be limited by the speed of a faster SSD (TB3 or USB4).
As compression appears to be CPU-bound, faster P cores would also be expected to result in shorter times.
Improving the efficiency of the compression code could increase performance.
Compression in a VM runs at about 78% of speed on the host.

To see how reliable these are, I therefore repeated the Cormorant test using all 10 P cores on the host Mac, which took 6.80 seconds, a little more than half the time for 5 cores. That’s a compression rate of 2.28 GB/s, rather less than the 2.69 GB/s predicted by the linear regression. That’s now approaching the write speed of some TB3 SSDs, and the rate-limiting step could then change to be the write performance of the storage being used, rather than CPU cores.

Last Week on My Mac: Tuning for performance

The Eclectic Light Company

hoakley

8 December 2024 at 16:00

Perhaps the greatest subversion of EVs is their threat to car culture. Over more than a century, modifying and tinkering with internal combustion engines has become a popular obsession, and grown a huge industry devoted to tuning for performance. Replacing those lovingly crafted vehicles with battery-powered appliances seems too much to bear for all those enthusiasts.

Half the Mac articles posted here last week are about improving performance, whether in the CPU cores of the M4 or when connecting to external displays and storage. In comments we compared predictions, benchmarks and experience in our quest for improvement. Most telling, though, are those who report little improvement in the software they use to earn their livelihood, products that have been central to Macs for decades, such as Adobe Photoshop since 19 February 1990. If all that engineering effort has had so little effect, there’s something seriously amiss.

There’s much to be transferred from our experience of tuning vehicles to improving the performance of our apps. First is the appreciation that benchtests don’t necessarily translate into what happens on the road. Second is the need for objective measurements to assess performance relevant to our aims. Third is use of a systematic approach to improvement, in recognising where bottlenecks or constraints are, and addressing them methodically.

Benchtests

When each new Apple silicon chip becomes accessible, there’s a race to post its first Geekbench results and thereby demonstrate how performant it is. YouTube is now filling up with demonstrations of impressive or disappointing figures shown on the dials of Blackmagic speed tests on Thunderbolt 5 devices. It’s significant here that the Blackmagic test displays analogue meters taken from those on traditional car dashboards.

Few ever drill down and ask what those numbers returned by Geekbench mean, nor the relevance of disk speed tests to situations where transfer to or from storage limits app performance. Although Primate Labs provide details of the compendium of tests used to calculate Geekbench scores, it’s impossible to know how those might compare with code run by the apps we use. A Mac with impressive benchmark scores can still be dog-slow when running our daily tasks.

Blackmagic Disk Speed Test appears to report write and read speeds measured for a single file size of 5 GB, but tells you little about how storage might cope with large numbers of smaller files, or those much larger. As results vary between tests, it needs some method of providing the best estimate for a series of results, and a measure of the confidence in that number. As a quick, fun method of checking whether storage is up to the task of recording and playing back video files, it’s ideal, but it’s neither intended nor suitable for more general purposes.

Objective measurements

Subjective assessments are widely known for their power to mislead. If you ever want the thrill of your life, ride a recumbent trike at anything over 30 miles an hour down a steep and winding hill. Because your bum and eyes are so much closer to the road surface it feels like three times that speed in a car. You can see similar effects in non-linear progress bars. When they’re slow to start with and accelerate rapidly from midway on, they appear quicker than the reverse, with an apparently interminable wait for the last ten percent to complete.

My heart sinks when someone tells me that something is slow without being specific about what that something is, and providing numbers to support that impression. The only objective assessments are quantitative, in numbers rather than feelings.

But those numbers must also measure something meaningful. In the past we’ve tried simply timing how long a Mac takes to start up, or to launch an app, as if that’s what we spend all day waiting for. When you only launch an app once a day and restart your Mac every couple of weeks, who cares whether those take a few seconds more or less?

Ideally, you need to identify tasks that you have to wait for repeatedly, with discrete instants at the start and end that are separated by several seconds at least. Those can then be timed fairly accurately and reproducibly to form your own task-specific performance benchmark. Then when you come to the stage of testing out the effects of different settings and hardware, you can compare those times. This could become a bit more technical; when assessing the performance of the macOS Unified log, for instance, I’ve calculated the time in nanoseconds between log entries, but that level of precision is rarely needed.

Identification

Armed with comparisons of the time to perform relevant key tasks, we must then analyse what’s involved in each and determine which step is rate-limiting. Most tasks involve several stages, perhaps starting with reading of data from storage into memory, then processing that and displaying an outcome. Those in turn depend on effective read speed, memory capacity and access time, and a host of operations in CPU cores, GPU and possibly other units in the chip.

The key to understanding performance is Activity Monitor, in its different views. Developers also use Xcode’s valuable collection of Instruments, there are also Signposts widely used in the log, and if you really want to hone in on what’s happening in processor cores there’s always powermetrics. But well before you risk getting lost in those weeds, observations in Activity Monitor should provide a good picture of what’s going on, and where delays are occurring.

Activity Monitor does have its pitfalls, as with any other method of investigation. Perhaps its greatest shortcoming on Apple silicon Macs is in not displaying or taking into account frequencies of the two types of CPU core, but at least its CPU History window does show which type of core is bearing the brunt.

Goal

Behind all this is the need for a more critical approach when tuning our Macs for better performance. Fitting a new exhaust system to a car might make it sound good, but unless you go to the trouble of tuning it for performance, it might actually be slower on the road, all bark and no bite. Unlike Apple’s devices, our Macs haven’t become appliances, and in coming articles I’ll explore how you can tune yours.

Last Week on My Mac: Mac mini M4 Pro first impressions, cores and more

The Eclectic Light Company

hoakley

10 November 2024 at 16:00

My Mac mini M4 Pro arrived four days early, and swiftly exceeded all expectations. Everyone who has seen this new design remarks on how tiny it is, and I still keep looking at it, wondering how the smallest Mac I have ever owned is also by far the fastest. For those who collect numbers, it matched the growing collection in Geekbench’s database, with a single-core score of 3,892, a multi-core of 22,706, and Metal GPU of 110,960. Of course those new MacBook Pros with M4 Max chips are doing better with their extra two CPU cores and twice the number of GPU cores, but second best is still far superior to anything I’ve run before.

Not only is this my fastest Mac ever, but it was also the quickest and simplest to commission. It replaces my Mac Studio M1 Max, and now sits under my Studio Display using the Studio’s old keyboard and trackpad. As I’ll explain in a future article, I opted to migrate to the mini during initial setup using the Studio’s backup SSD, a sleek OWC Envoy Pro SX connected by Thunderbolt 3. Including the inevitable macOS update, the whole process took less than two hours, from the start of unboxing to the first full Time Machine backup.

With 48 GB memory and an internal SSD of 2 TB, I was keen to benchmark that once its initial backup was done. Although I have reduced confidence in these figures, there’s no doubt that this SSD is significantly quicker than the 2 TB in the Studio; how much quicker is open to debate. AmorphousDiskMark reported a write speed of 7.7 GB/s, while my own Stibium gave 10.3 GB/s across a broad range of file sizes. There was closer agreement on read speed, at around 6.8 GB/s. Both of those are comfortably above what I expect when I can finally get hold of an external Thunderbolt 5 SSD.

CPU cores

My M4 Pro is the full-performance variant, with 10 P and 4 E cores, set in three clusters: an E cluster with all four E cores, and two clusters of five P cores. Since its M3 chips, Apple has increased the maximum number of cores in a cluster from 4 (in M1 and M2 chips) to 6 (M3 and M4).

M4 P cores run at frequencies between 1260 and 4512 MHz (1.3-4.5 GHz), their maximum frequency being 111% that of the M3 P core, and 140% that of the M1 P core. M4 E cores have a narrower range of frequencies than those in the M3, between 1020-2592 MHz (1.0-2.6 GHz), so running at 137% of M3 frequency when running low QoS threads, and 94% of M3 when running at maximum frequency for high QoS threads spilt over from P cores. Those are going to make comparisons between E core performance very interesting indeed.

Single-thread and single-core comparisons of P core throughput are inevitably in the M4’s favour, with significantly better performance across integer, floating point and vector performance, as shown in the chart below.

M4M3multiTests

The Y axis here gives loop throughput per second for my four basic in-core performance tests, a tight assembly code integer math loop, another tight assembly code loop of floating point math, NEON vector processor assembly code, and a tight loop calling an Accelerate routine run in the NEON unit. Pale blue bars are results for the M1, purple for the M3, and red for the shiny new M4.

Although improvements in integer and floating point performance might appear small here, these results are for single threads. When scaled up across more P cores, the differences are magnified, as shown in the regressions below.

M4vM3floatTimevThreads

This plots total times taken to execute multiple threads, each consisting of 10^9 (one thousand million, or one billion) loops of floating point assembly code, against the number of threads, the same as the number of cores used, as each core runs one thread. According to the fitted linear regression equations, each thread/core of 10^9 loops takes the same period: 6.04 seconds for the M3 P cores, and 4.69 for the M4. Those equate to 1.66 x 10^8 loops/second per thread for the M3, and 2.13 x 10^8 for the M4. On that basis, the M4 performs at 128% that of the M3, significantly better than expected by the 111% increase in maximum core frequency. That requires the M4 to have improved processing speed for the same frequency of the M3, a frequency-independent improvement.

Core allocation

One striking difference between M4 CPU cores and those of previous Apple silicon chips is their pattern of core allocation, something you may notice even in Activity Monitor, for all its faults. This is best illustrated in its CPU History window.

For these two series of tests, I waited until the Mac was idling quietly, with little happening on its CPU cores. I then ran, in rapid succession, a series of my in-core floating point tests of 10^9 loops at high QoS, starting with a single thread, then two, and so on up to six or eight threads. This results in a pattern distinct from anything you’ll see in Intel cores, as I have shown here since the early days of the M1.

m3profloptcompo

This is the series seen on my MacBook Pro M3 Pro running macOS 15.2 beta, and is typical of M1, M2 and M3 chips. Although the cores are out of order here, these are the six cores in the chip’s single P cluster. At the left is the obvious peak from 1 thread running on Core 9, then the 2 threads of the next test appear on cores 9 and 12. Three fully occupy 7, 9 and 12, with a little spilt onto 8, and so on until all six are fully occupied with 6 threads.

Activity Monitor is a little too crude to see how distinct this is, for which you have to resort to shorter sampling periods of 0.1 second, and the finer detail of active residency reported by powermetrics. Those confirm that, much of the time, these heavy in-core compute loads run in a single thread on a single core, and if they are moved around, it’s relatively infrequently. E cores are different, though.

Compare that with a similar sequence, this time for 1-8 threads, on the M4 Pro. Its ten P cores are divided into two clusters, which I’ve separated here with the blue line, although within each cluster the cores aren’t in numeric order.

m4profloptcompo

Reading again from the left, a single thread is run in cores 6 and 8, with a little in 14 in the second cluster. Two threads are run in all five cores of the second cluster, with a little at the end from those of the first cluster. Three threads are similar, with significant contributions from all ten cores, and from then on they are similar.

So, in the M4 P cores threads are no longer allocated to single cores, and it appears from the CPU History window that they may even be allocated across more than one cluster. In fact, when the detailed active residency and frequency data from powermetrics are used, while threads are often moved between two cores in the same cluster, macOS will normally avoid unnecessarily running threads on other clusters, although it will happily move all threads from one cluster to another.

The rationale behind running all threads within the same cluster is clear, as CPU core frequencies of all cores in the same cluster are the same. Constraining threads to a single cluster thus allows others to idle and use less energy. While that continues in M4 chips, it isn’t clear why threads are moved between cores so frequently, or moved together to another cluster, and whether that brings improvements in performance.