Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Last Week on My Mac: Tuning for performance

By: hoakley
8 December 2024 at 16:00

Perhaps the greatest subversion of EVs is their threat to car culture. Over more than a century, modifying and tinkering with internal combustion engines has become a popular obsession, and grown a huge industry devoted to tuning for performance. Replacing those lovingly crafted vehicles with battery-powered appliances seems too much to bear for all those enthusiasts.

Half the Mac articles posted here last week are about improving performance, whether in the CPU cores of the M4 or when connecting to external displays and storage. In comments we compared predictions, benchmarks and experience in our quest for improvement. Most telling, though, are those who report little improvement in the software they use to earn their livelihood, products that have been central to Macs for decades, such as Adobe Photoshop since 19 February 1990. If all that engineering effort has had so little effect, there’s something seriously amiss.

There’s much to be transferred from our experience of tuning vehicles to improving the performance of our apps. First is the appreciation that benchtests don’t necessarily translate into what happens on the road. Second is the need for objective measurements to assess performance relevant to our aims. Third is use of a systematic approach to improvement, in recognising where bottlenecks or constraints are, and addressing them methodically.

Benchtests

When each new Apple silicon chip becomes accessible, there’s a race to post its first Geekbench results and thereby demonstrate how performant it is. YouTube is now filling up with demonstrations of impressive or disappointing figures shown on the dials of Blackmagic speed tests on Thunderbolt 5 devices. It’s significant here that the Blackmagic test displays analogue meters taken from those on traditional car dashboards.

Few ever drill down and ask what those numbers returned by Geekbench mean, nor the relevance of disk speed tests to situations where transfer to or from storage limits app performance. Although Primate Labs provide details of the compendium of tests used to calculate Geekbench scores, it’s impossible to know how those might compare with code run by the apps we use. A Mac with impressive benchmark scores can still be dog-slow when running our daily tasks.

Blackmagic Disk Speed Test appears to report write and read speeds measured for a single file size of 5 GB, but tells you little about how storage might cope with large numbers of smaller files, or those much larger. As results vary between tests, it needs some method of providing the best estimate for a series of results, and a measure of the confidence in that number. As a quick, fun method of checking whether storage is up to the task of recording and playing back video files, it’s ideal, but it’s neither intended nor suitable for more general purposes.

Objective measurements

Subjective assessments are widely known for their power to mislead. If you ever want the thrill of your life, ride a recumbent trike at anything over 30 miles an hour down a steep and winding hill. Because your bum and eyes are so much closer to the road surface it feels like three times that speed in a car. You can see similar effects in non-linear progress bars. When they’re slow to start with and accelerate rapidly from midway on, they appear quicker than the reverse, with an apparently interminable wait for the last ten percent to complete.

My heart sinks when someone tells me that something is slow without being specific about what that something is, and providing numbers to support that impression. The only objective assessments are quantitative, in numbers rather than feelings.

But those numbers must also measure something meaningful. In the past we’ve tried simply timing how long a Mac takes to start up, or to launch an app, as if that’s what we spend all day waiting for. When you only launch an app once a day and restart your Mac every couple of weeks, who cares whether those take a few seconds more or less?

Ideally, you need to identify tasks that you have to wait for repeatedly, with discrete instants at the start and end that are separated by several seconds at least. Those can then be timed fairly accurately and reproducibly to form your own task-specific performance benchmark. Then when you come to the stage of testing out the effects of different settings and hardware, you can compare those times. This could become a bit more technical; when assessing the performance of the macOS Unified log, for instance, I’ve calculated the time in nanoseconds between log entries, but that level of precision is rarely needed.

Identification

Armed with comparisons of the time to perform relevant key tasks, we must then analyse what’s involved in each and determine which step is rate-limiting. Most tasks involve several stages, perhaps starting with reading of data from storage into memory, then processing that and displaying an outcome. Those in turn depend on effective read speed, memory capacity and access time, and a host of operations in CPU cores, GPU and possibly other units in the chip.

The key to understanding performance is Activity Monitor, in its different views. Developers also use Xcode’s valuable collection of Instruments, there are also Signposts widely used in the log, and if you really want to hone in on what’s happening in processor cores there’s always powermetrics. But well before you risk getting lost in those weeds, observations in Activity Monitor should provide a good picture of what’s going on, and where delays are occurring.

Activity Monitor does have its pitfalls, as with any other method of investigation. Perhaps its greatest shortcoming on Apple silicon Macs is in not displaying or taking into account frequencies of the two types of CPU core, but at least its CPU History window does show which type of core is bearing the brunt.

Goal

Behind all this is the need for a more critical approach when tuning our Macs for better performance. Fitting a new exhaust system to a car might make it sound good, but unless you go to the trouble of tuning it for performance, it might actually be slower on the road, all bark and no bite. Unlike Apple’s devices, our Macs haven’t become appliances, and in coming articles I’ll explore how you can tune yours.

Inside M4 chips: Controlling frequency

By: hoakley
2 December 2024 at 15:30

To realise best performance and energy efficiency from the big.LITTLE architecture in Apple’s M-series chips requires careful management on the part of macOS. There’s much more to it than balancing loads over conventional multi-core CPUs with a single type of core, as each execution thread needs to be run in an optimal location. When deciding where to run a CPU thread, macOS controls:

  • which type of core, P or E, primarily determined by the thread’s Quality of Service (QoS), and core availability;
  • which cluster to run it in, for chips with more than one cluster of that type, set to try to keep as few clusters active as necessary;
  • which core within that cluster, determined by core availability, and semi-randomised to even out core use;
  • what frequency to run that cluster at, in turn depending on the core type and the thread’s QoS;
  • mobility of that thread between cores in the same cluster, and between clusters (when available).

Over the last four years, I have explored the rules apparently used for the first two, and the choice of frequency in E cores. This article looks in more detail at how the frequency of P clusters appears to be determined in M1, M3 and particularly M4 chips.

powermetrics provides more frequency figures than you know what to do with, although most are derived and to some extent imaginary, making reconciliation difficult. In tests reported in the previous article, I used those given as Cluster HW active frequency to demonstrate distinctive patterns seen on M4 P cores running different numbers of test threads.

m4frequenciesByThreads

This graph shows those frequencies for the active P cluster by the number of threads, for floating point, NEON and vDSP_mmul tests detailed previously. Frequencies for the first two tests are identical, at P core maximum for a single thread, then falling sharply from 2 to 3 threads. When more threads are run, a cluster that’s fully active is run at the same frequency as that for 5 threads (P cluster size on this M4 Pro), while the other P cluster follows the same frequencies shown in the graph for the number of threads it’s running.

To examine this further I first climbed a mountain.

Climbing a mountain

For this test I used three copies of my test app to run a total of three identical threads of my in-core floating point test code in a mountain pattern. I first started powermetrics gathering data, then launched the first thread, followed by the second, and then the third. My objective was to observe an initial period when just one test thread was running, a second with the second test thread in addition, a third when all three threads would be running, and then watch the sequence reverse as each thread ended. This is shown in the results below.

m43appflopt1

This chart shows active residencies by core and cluster for the P cores in an M4 Pro, with 5 P cores in each of its two P clusters, during this test. For the first 15 sample periods (1.5 seconds), a single test thread is moved around between cores in the second P cluster (P1). That’s joined by the second thread run on another core in the same cluster, until sample 30, when the third thread is added, pushing the total active residency to 300%.

At that point, all three threads are moved to three cores in the first P cluster (P0), whose bars are shown in blues and green. The first thread completes in sample 37, leaving two threads with 200% active residency to continue in that cluster until sample 50, when the second thread completes, leaving just one running. In sample 54 (5.4 seconds after test start), that one remaining thread is moved back to complete on core P11 in the second cluster late in sample 63.

In that period of 6.3 seconds, each of the two P clusters has run 1, 2 and 3 threads.

m43appflopt2

This graph shows cluster frequencies over the same period, this time given in seconds elapsed rather than sample number. The red line and points show the frequency of cluster P1, and blue for P0. Those undergo step changes when each cluster is running test threads. The inactive cluster is normally shut down with a frequency of 0 MHz, although there are some brief spikes from that as well.

m43appflopt3

Combining active residency bars in yellow with core frequency lines, it’s clear that cluster frequency is close to core maximum at 4,500 MHz when only a single thread is running. With two threads, it’s reduced to 4,400 MHz, and down to 3,900 MHz when all three threads are running. Those changes are symmetrical for loading and unloading clusters, and shown no signs of hysteresis (different values during loading and unloading).

Closer examination gives frequencies of 4,511 MHz at 100% active residency, 4,415 MHz at 200%, and 3,924 MHz at 300%e. The latter is 87% of maximum frequency, a large enough reduction to be reflected in performance. Essentially identical figures are found for NEON tests as well as these for floating point.

Although this test method can give highly reproducible results, the floating point and NEON tests used don’t resemble threads seen in everyday use. The next step is to extend that by looking at thread numbers and frequency when running more normal code.

Compressing a file

Fortunately, I have already built a suitable platform for real-world testing in a one-trick pony named Cormorant, a basic compression-decompression utility using Apple Archive. Although not a patch on serious apps like Keka, Cormorant can set the number and QoS of threads to be run during compression/decompression. Because it relies on Apple’s framework, it actually runs more than just the threads set in its controls, but still provides a way to control active residency. I therefore ran a test compression (15.5 GB IPSW image file) at maximum QoS to ensure it’s dispatched to P cores, and 1-3 threads.

Time taken to compress the test file changes greatly according to the number of threads used:

  • 1 thread takes 49.6 s, at 313 MB/s;
  • 2 threads takes 26.8 s, at 578 MB/s, 191% of the throughput of the single thread;
  • 3 threads takes 18.7 s, at 829 MB/s, 265% of the single thread.

These appear to follow the pattern of frequencies observed on my in-core tests.

m4corm1thread1

This chart shows the opening 3 seconds of single-thread compression, with cluster frequency in the points and line, and total active residency multiplied by 10 (to scale to a common y axis) in pale blue bars. Two significant periods are shown: in samples 12-21, active residency is high, between 300-430%, and frequency is lower at around 4,000 MHz. Following that, active residency falls to about 200% and frequency rises to 4,200 MHz.

Because active residency was so variable in these tests, I pooled paired values for that and cluster frequency, gathered over 3 second periods, and plotted those.

m4corm1thread2

Although at active residencies below about 180% there’s a wide scatter, above that there’s a good linear regression, showing steady decline in frequency over active residencies ranging from 180% to 450%.

The following two graphs show equivalents for tests using 2 and 3 threads. The first of those has two outliers at total active residencies above 490%, corresponding to unusual conditions during the test. I have therefore excluded those from subsequent analyses.

m4corm2thread1

m4corm3thread1

The last step is to pool paired results from all three test conditions, and arrive at a line of best fit.

m4corm1-3Poolthread1

Between total cluster residencies of 150-500%, this works best with a quadratic curve with the equation
F = 4630.05 – (2.0899 x R) + (0.0010107 x R^2)
from which a different relationship is predicted between F, frequency in MHz, and R, total active residency in %.

My other real-world test makes use of the fact that, when virtualising macOS, the number of virtual cores on the host is specified.

Hosting virtual cores

Although virtualisation relies on frameworks run on the host, experience is that its demand on host P cores is constrained to the number of virtual cores allocated, with each of those resulting in 100% active residency, equating to the whole of a P core on the host. powermetrics started collecting sample periods immediately before a macOS 14 VM was launched, and the first 3 seconds (30 samples) were collected and analysed for VMs with 1-3 virtual cores.

m4vm1thread1

This shows the first of those, a VM allocated just a single virtual core, with cluster frequencies shown as red and blue lines, and total active residency multiplied by 10 in the pale blue bars. With a steady total active residency of 100%, active cluster frequency was about 4,500 MHz. Note that sample 7 included transfer of the threads from P1 to P0 in a sharp peak to a total active residency of over 500%.

Average frequencies can thus be calculated for each of the three tests, at 100-300% active residencies.

Set frequencies

I now have estimates of cluster frequencies for cluster total active residencies from:

  • in-core tests using floating point
  • in-core tests using NEON
  • compression
  • virtualisation

against which I compare a matrix multiplication test that may be run on shared matrix co-processors (AMX). These are shown in the table below.

m4pclusterfreq

Running a single thread in a cluster should result in a total active residency of 100%, for which macOS sets the cluster frequency at P core maximum, of 4,400-4,511 MHz. That for 200% is lower, at between 4,000-4,400 MHz, and falls off further to about 3,800 when all 5 cores are at 100% active residency. Frequencies set for the vDSP_mmul test are significantly lower throughout, supporting the proposal that test isn’t being run conventionally in P cores, but in a co-processor.

A sixth thread would then be loaded onto the other P cluster, where cluster frequency would be set at P core maximum again, progressively reducing with additional threads until that cluster was also running at about 3,800 MHz.

Following this, I returned to the tests I have performed over the last four years on M1 and M3 P cores. Although I haven’t analysed those formally, I now believe that their frequencies are controlled by macOS as follows:

  • M1 1 core at 3,228 MHz, 2 cores 3,132 MHz, 3-6 cores 3,036 MHz.
  • M3 1 core at 3,624 MHz (below maximum of 4,056), 2-6 cores 3,576 MHz.

The range of frequencies in the M1 and M3 is narrower, resulting in less difference in performance between single- and multi-core tests. However, the M4 falls to 87% maximum frequency at 3 threads and more, which is substantial. It’s worth noting that Geekbench single-core results for the M4 are around 3,892 and would scale up to a multi-core result of 38,920 on an M4 Pro with 10 P cores, whereas the actual multi-core score is about 22,700, 58% of the scaled value. Although the effects of lower frequency can’t account for all that difference, they must surely contribute to it.

Why?

Two plausible contenders for the reason that macOS reduces P cluster frequency with increasing active residency are for thermal management, hence reliability, and when competing for a limited resource, perhaps the L2 cache shared within each cluster.

Reductions in cluster frequency seen here isn’t thermal throttling, though. Tests were intentionally kept brief in order to accommodate their results in reasonably short series of powermetrics results. Power use was highest in the NEON and vDSP_mmul tests, and lowest in floating point, although there don’t appear to be matching differences in frequency control. As noted in the previous tests, High Power mode didn’t alter frequency control, although frequencies were reduced in Low Power mode.

It’s most likely that this frequency regulation is pre-emptive, and based not just on CPU cores, but allows for likely heat output in the rest of the Mac.

Key information

  • When running on Apple silicon Macs, macOS modulates ‘cluster HW active frequency’ of P cores, limiting frequency to below maximum when cluster total active residency exceeds 100%.
  • Although small in M1 variants, this is most prominent in M4 variants, where a total active residency of 300% may reduce cluster frequency to 87% of maximum.
  • Frequency limitation is most probably part of a pre-emptive strategy in thermal management.
  • Frequency limitation is at least partly responsible for non-linear changes in performance with increasing recruitment of P cores, as illustrated in single- and multi-core benchmarks.
  • Control of P cores by macOS is complex, particularly in M4 variants.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Inside M4 chips: Matrix processing and Power Modes

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

Acknowledgements

Several of you have contributed to discussions here, but Maynard Handley has for several years provided sage advice, challenging discussion, and his personal mine of information. Thank you all.

Inside M4 chips: Matrix processing and Power Modes

By: hoakley
27 November 2024 at 15:30

For much of the last four years, CPU and GPU core performance have been of primary importance to those using Apple silicon Macs. Despite that, special interest has developed in those cores that Apple doesn’t talk about, in its matrix co-processor, AMX. Since the first M1 it has been believed that each CPU core cluster has its own AMX, and more recently it has been shown to be capable of impressive performance. With the increasing prominence of computationally intensive features including AI, this is growing in importance.

When the M4 first became available in iPads earlier this year, researchers concluded its chip has a new AMX co-processor that can now be programmed using SME matrix extensions supported by the new version of the Arm instruction set in the M4, ARMv9.2-A. The best accounts of that challenging work are given on the Friedrich Schiller University’s site Hello SME, and Taras Zakharko’s GitHub.

From early work by Dougall Johnson on the M1, it has been known that some of the functions in Apple’s vast Accelerate maths libraries can run code on the AMX. Thanks to the guidance of Maynard Handley, a year ago I concluded that one of those is the vDSP_mmul function in the vDSP sub-library. This article reports tests of that function in a Mac mini M4 Pro running Sequoia 15.1.1, leads on to an explanation of previous results using floating point and NEON tests, and considers the effects of Power Modes.

Methods

I used the same methods as described previously. These consist of running many tight loops of test code designed to be confined as much as possible to register access. The test used here uses vDSP_mmul, multiplying two 16 x 16 32-bit floating point matrices, and consists of 15 x 10^6 such loops. Its source code is given in the Appendix at the end. During each test run, the command tool powermetrics gathers core power and performance data in sampling periods of 0.1 second for a total of 50 samples. Its output is piped into a text file, which is then analysed using Numbers and DataGraph.

For comparison, I also show here my results from equivalent tests using assembly code for the NEON vector processor, as given in that previous article.

Power used by thread

The first graphs show average power use during each test with increasing numbers of threads, where error bars indicate a spread of +1 standard deviation.

m4powervdsp1

The linear relationship for 1-10 threads, run on 1-10 P cores, is a better fit for vDSP_mmul, shown above, than NEON below. Run on P cores, each vDSP_mmul thread used about 3.6 W, significantly greater than NEON at 3.0 W. However, when those high QoS threads spilt over onto E cores, that relationship for vDSP_mmul broke down, leaving the highest power use that for 10 threads, one for each P core available in the M4 Pro chip used.

There’s no evidence of the steps seen below in NEON at 2-3 and 7-8 threads.

m4powerneon1

Execution time

m4powervdsp2

Total execution time has a strongly linear relationship too, for vDSP_mmul at about 2.6 seconds per thread. Again, that relationship broke down once test threads had spilt over onto E cores, unlike in the graph below for NEON.

m4powerneon2

m4powervdsp3

There are also obvious differences in the time required to execute each thread. vDSP_mmul (above) showed a linear increase from 1-5 threads, followed by a constant time for 5-10 threads. Once threads were running on E cores, the relationship became far looser, as seen in the red regression line. In NEON below, the relationship was weakest in the 1-5 thread range, and closer on the E cores from 11-14 threads.

m4powerneon3

Energy use

m4powervdsp4

When run on P cores, there’s a good line of fit indicating energy use of 9.2 J per thread (above), compared with NEON (below) at 7.7 J. The red line of best fit for the E core section from 11-14 cores suggests E core energy use of about 4.8 J per thread, again higher than NEON, which shows a much better fit and only about 3 J per thread.

m4powerneon4

Maximum total energy use estimated for vDSP_mmul was just over 140 J, while for NEON it was only about 90 J.

The overall picture of vDSP_mmul is thus different from those seen in floating point and NEON tests. When run on P cores alone, vDSP_mmul behaves more linearly, using significantly more power and energy. Once running threads on E cores, though, that breaks down and performance falls, rather than simply slowing.

The role of frequency

There has long been a tacit assumption that, when running on P cores, computationally intensive threads such as those used in these tests are run at a fairly constant frequency close to maximum. Looking back at my earlier results on M1 and M3 cores, though, frequencies aren’t so consistent, and in many cases not that close to maximum either.

powermetrics provides more frequency figures than you know what to do with, although most are derived and to some extent imaginary, making reconciliation difficult. Taking the best estimate of core frequency as that given as Cluster HW active frequency, patterns seen on M4 P cores are distinctive.

m4frequenciesByThreads

This graph shows those frequencies for the active P cluster by the number of threads, for floating point, NEON and vDSP_mmul tests. Frequencies for the first two are identical, at P core maximum for a single thread, then falling sharply from 2 to 3 threads. When more threads are run, the cluster that’s fully active is run at the same frequency as that for 5 threads (P cluster size on this M4 Pro), while the other P cluster follows the same frequencies shown in the graph for the number of threads it’s running.

Frequencies are controlled by macOS, and this suggests it adopts a standard pattern when running the two in-core tests, and a different one for vDSP_mmul presumably geared to performance of the AMX. Changes in frequency also account for the steps seen at 2-3 and 7-8 (= 5+2 and 5+3) threads in the NEON graphs above.

Power Modes

There has been considerable interest in the Power Mode setting available in macOS when running on M4 Pro and Max chips. To assess its effects I ran tests in 10 threads to fully occupy the P cores, at each of the three Power mdes.

There was no difference between results for the default Automatic and High Power modes, as expected. This is because the effect of High Power mode isn’t to change frequencies or power use in the short-term, but by more aggressive fan use enabling higher frequencies to be sustained for prolonged periods, when in Automatic mode they would be throttled.

Low Power does have substantial effects on core frequency, performance and power use, though. When running floating point tests in 10 threads, their cluster frequency was reduced from 3,852 to 3,624, 94% of Automatic and High Power. That reduced power use from a mean of 13.9 W to 11.2 W, and increased the time to complete threads. Time taken by floating point threads increased to 106% of Automatic and High, while that for NEON increased to 135% and vDSP_mmul to 177%. While the reduced performance for floating point threads is unlikely to be noticeable, for vector and matrix threads that’s likely to obvious to the user.

Key information

  • vDSP_mmul matrix multiplication from the vDSP sub-library in Accelerate behaves consistently with it being performed in the AMX co-processors in M4 Pro chips.
  • vDSP_mmul threads used significantly more power than NEON, reaching a maximum of just over 36 W when fully occupying all 10 P cores.
  • When spilt over to E cores, vDSP_mmul threads were much slower and their performance erratic, consistent with the E cluster having a smaller and significantly less performant AMX.
  • In-core tests (floating point and NEON) show common frequency regulation according to the number of cores active in each P core cluster. This runs a single thread at maximum frequency, then reduces sharply from 2 to 3 threads/cores. This accounts for the deviations from linearity observed in power use and performance. That pattern doesn’t appear in vDSP_mmul threads, though.
  • High Power and Automatic modes are identical in short-term tests.
  • Low Power mode reduces P cluster frequency and power use. Although its effects are unlikely to be noticeable in floating point threads, effects on vector and matrix threads are greater, and performance reductions are likely to be obvious to the user.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Finding and evaluating AMX co-processors in Apple silicon chips (M1 and M3)

Appendix: Source code

16 x 16 32-bit floating point matrix multiplication

var theCount: Float = 0.0
let A = [Float](repeating: 1.234, count: 256)
let IA: vDSP_Stride = 1
let B = [Float](repeating: 1.234, count: 256)
let IB: vDSP_Stride = 1
var C = [Float](repeating: 0.0, count: 256)
let IC: vDSP_Stride = 1
let M: vDSP_Length = 16
let N: vDSP_Length = 16
let P: vDSP_Length = 16
A.withUnsafeBufferPointer { Aptr in
B.withUnsafeBufferPointer { Bptr in
C.withUnsafeMutableBufferPointer { Cptr in
for _ in 1...theReps {
vDSP_mmul(Aptr.baseAddress!, IA, Bptr.baseAddress!, IB, Cptr.baseAddress!, IC, M, N, P)
theCount += 1
} } } }
return theCount

Apple describes vDSP_mmul() as performing “an out-of-place multiplication of two matrices; single precision.” “This function multiplies an M-by-P matrix A by a P-by-N matrix B and stores the results in an M-by-N matrix C.”

❌
❌