Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

A brief history of Activity Monitor

By: hoakley
14 December 2024 at 16:00

In the days of Classic Macs, most of our concerns centred on memory rather than CPU, disk or network performance. Tools for managing memory flourished, and I don’t recall any utility provided in Mac OS that looked more broadly. That changed when Mac OS X arrived, and we started taking an interest in Process Viewer and CPU Monitor.

cpumonitor2001

CPU Monitor, seen here in 2001 with its two floating chart views, was a first step in development. I suspect this was taken on my QuickSilver Power Mac G4 with its dual processors, hence the pair of CPU charts.

processviewer2001

Process Viewer is a more obvious ancestor of the modern Activity Monitor, with its list of processes in a column view. Note the surprisingly few system processes running at the time: only 34, compared with today’s list of several hundred.

activitymonitor2005

Activity Monitor integrated those features into its new app in Mac OS X 10.3 Panther in 2003, and is shown here in 10.4 Tiger from a couple of years later, where it’s surprisingly similar to the current app. There are still only 72 processes running, though, and most have less than 5 threads.

activitymonitor2011

Here’s Activity Monitor with its CPU History window on an 8-core Mac Pro, running Mac OS X 10.7 Lion in 2011. The number of processes is still only growing slowly, and had reached 89.

instruments2012a

Xcode’s Instruments added an Activity Monitor template for developers wanting to monitor further details as they tested and debugged their code. At this stage it was limited to presenting the same fundamental measurements as were available in Activity Monitor, but these have since flourished into much greater depth.

instruments2012b

Two years later, in 2013 for OS X 10.9 Mavericks, Activity Monitor underwent revision to add a new tab for Energy. Unlike other panes, the ‘Energy Impact’ is given in arbitrary units rather than calculating Joules from power use over time. It was at about this time that the app also added a Cache pane analysing the performance of the Content Caching service, when enabled on the Mac.

activitymonitor2020i

Until 2020, Activity Monitor had dealt with CPUs with uniform cores. Above are the eight physical and eight Hyper-threaded cores in an 8-core Intel Xeon W in an iMac Pro from 2017, running a heavy load of over 700% CPU. With the first Apple silicon Macs, it had to display CPU use for two different types of core. Note how, by 2020, the total number of processes has shot up to 458.

activitymonitor2021m1

This is an example from one of the first base M1 Macs, with 4 Efficiency and 4 Performance cores displayed neatly in its CPU History window. Although Activity Monitor doesn’t take core frequency into account when measuring % CPU, and can’t display cluster frequencies, it remains one of the essential tools for everyone, whichever age and architecture their Mac.

Tune for Performance: Activity Monitor’s CPU view

By: hoakley
11 December 2024 at 15:30

Activity Monitor is one of the unsung heroes of macOS. If you understand how to interpret its rich streams of measurements, it’s thoroughly reliable and has few vices. When tuning for performance, it should be your starting point, a first resort before you dig deeper with more specialist tools such as Xcode’s Instruments or powermetrics.

How Activity Monitor works

Like Xcode’s Instruments and powermetrics, the CPU view samples CPU and GPU activity over brief periods of time, displays results for the last sampling period, and updates those every 1-5 seconds. You can change the sampling period used in the Update Frequency section of the View menu, and that should normally be set to Very Often, for a period of 1 second.

This was normally adequate for many purposes with older M-series chips, but thread mobility in M4 chips can be expected to move threads from core to core, and between whole clusters, at a frequency similar to available sampling periods, and that loses detail as to what’s going on in cores and clusters, and can sometimes give the false impression that a single thread is running simultaneously on multiple cores.

However, sampling frequency also determines how much % CPU is taken by Activity Monitor itself. While periods of 0.1 second and less are feasible with powermetrics, in Activity Monitor they would start to affect its results. If you need to see finer details, then you’ll need to use Xcode Instruments or powermetrics instead.

% CPU

Central to the CPU view, and its use in tuning for performance, is what Activity Monitor refers to as % CPU, which it defines as the “percentage of CPU capability that’s being used by processes”. As far as I can tell, this is essentially the same as active residency in powermetrics, and it’s vital to understand its strengths and shortcomings.

Take a CPU core that’s running at 1 GHz. Every second it ‘ticks’ forward one billion times. If an instruction were to take just one clock cycle, then it could execute a billion of those every second. In any given second, that core is likely to spend some time idle and not executing any instructions. If it were to execute half a billion instructions in any given second, and spend the other half of the time idle, then it has an idle residency of 50% and an active residency of 50%, and that would be represented by Activity Monitor as 50% CPU. So a CPU core that’s fully occupied executing instructions, and doesn’t idle at all, has an active residency of 100%.

To arrive at the % CPU figures shown in Activity Monitor, the active residency of all the CPU cores is added together. If your Mac has four P and four E cores and they’re all fully occupied with 100% active residency each, then the total % CPU shown will be 800%.

There are two situations where this can be misleading if you’re not careful. Intel CPUs feature Hyper-threading, where each physical core acquires a second, virtual core that can also run at another 100% active residency. In the CPU History window those are shown as even-numbered cores, and in % CPU they double the total percentage. So an 8-core Intel CPU then has a total of 16 cores, and can reach 1600% when running flat out with Hyper-threading.

coremanintel

cpuendstop

The other situation affects Apple silicon chips, as their CPU cores can be run at a wide range of different frequencies according to control by macOS. However, Activity Monitor makes no allowance for their frequency. When it shows a core or total % CPU, that could be running at a frequency as low as 600 MHz in the M1, or as high as 4,512 MHz in the M4, nine times as fast. Totalling these percentages also makes no allowance for the different processing capacity of P and E cores.

Thus an M4 chip’s CPU cores could show a total of 400% CPU when all four E cores are running at 1,020 MHz with 100% active residency, or when four of its P cores are running at 4,512 MHz with 100% active residency. Yet the P cores would have an effective throughput of as much as six times that of the E cores. Interpreting % CPU isn’t straightforward, as nowhere does Activity Monitor provide core frequency data.

tuneperf1

In this example from an M4 Pro, the left of each trace shows the final few seconds of four test threads running on the E cores, which took 99 seconds to complete at a frequency of around 1,020 MHz, then in the right exactly the same four test threads completed in 23 seconds on P cores running at nearer 4,000 MHz. Note how lightly loaded the P cores appear, although they’re executing the same code at almost four times the speed.

Threads and more

For most work, you should display all the relevant columns in the CPU view, including Threads and GPU.

tuneperf2

Threads are particularly important for processes to be able run on multiple cores simultaneously, as they’re fairly self-contained packages of executable code that macOS can allocate to a core to run. Processes that consist of just a single thread may get shuffled around between different cores, but can’t run on more than one of them at a time.

Another limitation of Activity Monitor is that it can’t tell you which threads are running on each core, or even which type of core they’re running on. When there are no other substantial threads active, you can usually guess which threads are running where by looking in the CPU History window, but when there are many active threads on both E and P cores, you can’t tell which process owns them.

You can increase your chances of being able to identify which cores are running which threads by minimising all other activity, quitting all other open apps, and waiting until routine background activities such as Spotlight have gone quiet. But that in itself can limit what you can assess using Activity Monitor. Once again, Xcode Instruments and powermetrics can make such distinctions, but can easily swamp you with details.

GPU and co-processors

Percentages given for the GPU appear to be based on a similar concept of active residency without making any allowance for frequency. As less is known about control of GPU cores, that too must be treated carefully.

Currently, Activity Monitor provides no information on the neural processing unit (ANE), or the matrix co-processor (AMX) believed to be present in Apple silicon chips, and that available using other tools is also severely limited. For example, powermetrics can provide power use for the ANE, but doesn’t even recognise the existence of the AMX.

Setting up

In almost all cases, you should set the CPU view to display All processes using the View menu. Occasionally, it’s useful to view processes hierarchically, another setting available there.

Extras

Activity Monitor provides some valuable extras in its toolbar. Select a process and the System diagnostics options tool (with an ellipsis … inside a circle) can sample that process every millisecond to provide a call graph and details of binary images. The same menu offers to run a spindump, for investigating processes that are struggling, and to run sysdiagnose and Spotlight diagnostic collections. The Inspect selected process tool ⓘ provides information on memory, open files and ports, and more.

Explainers

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

multiterms1

When an app is run, there’s a single runtime instance created from its single executable code. This is given its own virtual memory and access to system resources that it needs. This is a process, and listed as a separate entity in Activity Monitor.

Each process has a main thread, a single flow of code execution, and may create additional threads, perhaps to run in the background. Threads don’t get their own virtual memory, but share that allocated to the process. They do, though, have their own stack. On Apple silicon Macs they’re easy to tell apart as they can only run on a single core, although they may be moved between cores, sometimes rapidly. In macOS, threads are assigned a Quality of Service (QoS) that determines how and where macOS runs them. Normally the process’s main thread is assigned a high QoS, and background threads that it creates may be given the same or lower QoS, as determined by the developer.

Within each thread are individual tasks, each a quantity of work to be performed. These can be brief sections of code and are more interdependent than threads. They’re often divided into synchronous and asynchronous tasks, depending on whether they need to be run as part of a strict sequence. Because they’re all running within a common thread, they don’t have individual QoS although they can have priorities.

Tune for Performance: do more threads run faster?

By: hoakley
10 December 2024 at 15:30

One of the most distinctive features about modern Macs is that they have multiple cores, in Apple silicon models a minimum of eight. For an app to be able to make use of more than one core (or its equivalent) at a time it needs to divide its processing into threads, discrete blocks of code that can be run on different cores by macOS. If it doesn’t do that, then running that app on a high-end Pro, Max or Ultra chip is unlikely to be significantly faster than on a base model (assuming that task is CPU-bound).

Activity Monitor

You might be able to get a good idea as to how well an app makes use of multiple cores from watching it in use in Activity Monitor’s CPU History window, but in many cases that isn’t conclusive, and needs to be confirmed.

To illustrate ways to tackle this, I take the example of file compression. Some methods lend themselves to multiple threads better than others, and some may be implemented in a way that won’t accelerate when several cores are available. For the occasional user this might make little difference, but if you were to spend much of your day waiting for 10-100 GB files to compress, it merits a little exploration.

polycore1

This CPU History window from a compression task run on an M4 Pro is typically unhelpful. A single file compression is seen in the group of 4-5 peaks in CPU in the right half of each trace. Although compression used all ten P cores, it appears to have been moved between P clusters, seen by comparing the timing of peaks on Cores 8 and 10, 9 and 11. At no time does the task appear to exceed 50% active residency on any of the cores, though. All you can do is guess as to what might be going on, and whether it might run faster on twelve or more P cores.

polycore2

On another run, the same task is reported to have reached a peak of 500% CPU in 12 threads, but would it exceed that on more P cores, given that 10 were available?

Timing performance

As usual, I’ve been cheating a little to generate those results, by using my simple compression-decompression app Cormorant, in which I control how many threads it uses. If the app you’re trying to investigate offers a similar feature, then you can set up a standard task, here the compression of a 15.517 GB IPSW file, and time how long it takes using different numbers of threads.

polycore3

The answers from Cormorant, which conveniently performs its own timing, are:

  • 1 thread takes 49.32 seconds
  • 2 take 26.74 s
  • 3 take 18.60 s
  • 4 take 14.29 s
  • 5 take 11.21 s.

So Cormorant’s compression can make good use of more P cores, although with diminishing returns.

Very few apps give you this level of control, though. The only other compression utility that does appear to is Keka.

polycore4

In its settings, you can give its tasks a maximum number of threads, and even run them at custom Quality of Service (QoS) if you want them to be run in the background on E cores, and not interrupt your work on P cores.

Controlling threads

There is one way that you can limit the effective number of threads used by an arbitrary app, and that’s to run it in a Virtual Machine, as you control the number of virtual cores that it uses. While you can just about run a macOS VM on a single core alone, I suggest that a more workable starting point is two cores, and you can increase that number up to the total of P cores in the host without causing problems.

VMs have other virtues, including their relative lack of background processes, allowing their virtual cores to be almost entirely devoted to your test task. However, as they can’t run apps from the App Store other than Apple’s free suite of Pages, Numbers and Keynote, that could prevent you from using them for testing.

To set up a VM for thread tests, I duplicated a standard Sonoma 14.7.1 VM so I could throw it away at the end, opened it with five virtual cores, and copied over the test app Cormorant and file. I then closed that VM down, set it to use 2 virtual cores, opened it and ran my test. I repeated that with an increasing number of virtual cores up to a total of 5. Compression times are:

  • 2 vCPUs take 30.13 seconds
  • 3 take 21.36 s
  • 4 take 17.18 s
  • 5 take 13.59 s.

Those are only slightly slower than their equivalents from the host tests above.

Analysis

polycore5

Plotting those results out using DataGraph, the lines of best fit follow power laws:

  • for real cores, time = 49.8863/(T^0.91169)
  • for virtual cores, time = 54.737/(T^0.854318)

where T is the number of threads. That explains the apparently diminishing returns with increasing numbers of threads, although the maths isn’t as simple as we might like.

A better way to look at this is by calculating the rate of compression in GB/s, simply by dividing the file size of 15.517 GB by each time. Here we end up with straight lines from linear regression, that are more amenable to thought.

polycore6

Those regressions are:

  • for real cores, rate of compression = 0.0464 + (0.264 x T)
  • for virtual cores, rate of compression = 0.102 + (0.206 x T)

which are more generally useful.

Tuning performance

This might appear over-elaborate and of little practical use, but we now have a much better understanding of the factors limiting compression performance:

  • The more threads compression uses, the shorter time a task will take.
  • There are limits to that improvement, though, when substantially more than 5 threads are used.
  • Compression rates achieved in M4 P cores are significantly lower than read or write speeds of the internal SSD, so aren’t likely to be limited by the speed of a faster SSD (TB3 or USB4).
  • As compression appears to be CPU-bound, faster P cores would also be expected to result in shorter times.
  • Improving the efficiency of the compression code could increase performance.
  • Compression in a VM runs at about 78% of speed on the host.

To see how reliable these are, I therefore repeated the Cormorant test using all 10 P cores on the host Mac, which took 6.80 seconds, a little more than half the time for 5 cores. That’s a compression rate of 2.28 GB/s, rather less than the 2.69 GB/s predicted by the linear regression. That’s now approaching the write speed of some TB3 SSDs, and the rate-limiting step could then change to be the write performance of the storage being used, rather than CPU cores.

Last Week on My Mac: Tuning for performance

By: hoakley
8 December 2024 at 16:00

Perhaps the greatest subversion of EVs is their threat to car culture. Over more than a century, modifying and tinkering with internal combustion engines has become a popular obsession, and grown a huge industry devoted to tuning for performance. Replacing those lovingly crafted vehicles with battery-powered appliances seems too much to bear for all those enthusiasts.

Half the Mac articles posted here last week are about improving performance, whether in the CPU cores of the M4 or when connecting to external displays and storage. In comments we compared predictions, benchmarks and experience in our quest for improvement. Most telling, though, are those who report little improvement in the software they use to earn their livelihood, products that have been central to Macs for decades, such as Adobe Photoshop since 19 February 1990. If all that engineering effort has had so little effect, there’s something seriously amiss.

There’s much to be transferred from our experience of tuning vehicles to improving the performance of our apps. First is the appreciation that benchtests don’t necessarily translate into what happens on the road. Second is the need for objective measurements to assess performance relevant to our aims. Third is use of a systematic approach to improvement, in recognising where bottlenecks or constraints are, and addressing them methodically.

Benchtests

When each new Apple silicon chip becomes accessible, there’s a race to post its first Geekbench results and thereby demonstrate how performant it is. YouTube is now filling up with demonstrations of impressive or disappointing figures shown on the dials of Blackmagic speed tests on Thunderbolt 5 devices. It’s significant here that the Blackmagic test displays analogue meters taken from those on traditional car dashboards.

Few ever drill down and ask what those numbers returned by Geekbench mean, nor the relevance of disk speed tests to situations where transfer to or from storage limits app performance. Although Primate Labs provide details of the compendium of tests used to calculate Geekbench scores, it’s impossible to know how those might compare with code run by the apps we use. A Mac with impressive benchmark scores can still be dog-slow when running our daily tasks.

Blackmagic Disk Speed Test appears to report write and read speeds measured for a single file size of 5 GB, but tells you little about how storage might cope with large numbers of smaller files, or those much larger. As results vary between tests, it needs some method of providing the best estimate for a series of results, and a measure of the confidence in that number. As a quick, fun method of checking whether storage is up to the task of recording and playing back video files, it’s ideal, but it’s neither intended nor suitable for more general purposes.

Objective measurements

Subjective assessments are widely known for their power to mislead. If you ever want the thrill of your life, ride a recumbent trike at anything over 30 miles an hour down a steep and winding hill. Because your bum and eyes are so much closer to the road surface it feels like three times that speed in a car. You can see similar effects in non-linear progress bars. When they’re slow to start with and accelerate rapidly from midway on, they appear quicker than the reverse, with an apparently interminable wait for the last ten percent to complete.

My heart sinks when someone tells me that something is slow without being specific about what that something is, and providing numbers to support that impression. The only objective assessments are quantitative, in numbers rather than feelings.

But those numbers must also measure something meaningful. In the past we’ve tried simply timing how long a Mac takes to start up, or to launch an app, as if that’s what we spend all day waiting for. When you only launch an app once a day and restart your Mac every couple of weeks, who cares whether those take a few seconds more or less?

Ideally, you need to identify tasks that you have to wait for repeatedly, with discrete instants at the start and end that are separated by several seconds at least. Those can then be timed fairly accurately and reproducibly to form your own task-specific performance benchmark. Then when you come to the stage of testing out the effects of different settings and hardware, you can compare those times. This could become a bit more technical; when assessing the performance of the macOS Unified log, for instance, I’ve calculated the time in nanoseconds between log entries, but that level of precision is rarely needed.

Identification

Armed with comparisons of the time to perform relevant key tasks, we must then analyse what’s involved in each and determine which step is rate-limiting. Most tasks involve several stages, perhaps starting with reading of data from storage into memory, then processing that and displaying an outcome. Those in turn depend on effective read speed, memory capacity and access time, and a host of operations in CPU cores, GPU and possibly other units in the chip.

The key to understanding performance is Activity Monitor, in its different views. Developers also use Xcode’s valuable collection of Instruments, there are also Signposts widely used in the log, and if you really want to hone in on what’s happening in processor cores there’s always powermetrics. But well before you risk getting lost in those weeds, observations in Activity Monitor should provide a good picture of what’s going on, and where delays are occurring.

Activity Monitor does have its pitfalls, as with any other method of investigation. Perhaps its greatest shortcoming on Apple silicon Macs is in not displaying or taking into account frequencies of the two types of CPU core, but at least its CPU History window does show which type of core is bearing the brunt.

Goal

Behind all this is the need for a more critical approach when tuning our Macs for better performance. Fitting a new exhaust system to a car might make it sound good, but unless you go to the trouble of tuning it for performance, it might actually be slower on the road, all bark and no bite. Unlike Apple’s devices, our Macs haven’t become appliances, and in coming articles I’ll explore how you can tune yours.

Inside M4 chips: P cores hosting a VM

By: hoakley
13 November 2024 at 15:30

One common but atypical situation for any M-series chip is running a macOS virtual machine. This article explores how virtual CPU cores are handled on physical cores of an M4 Pro host, and provides further insight into their management, and thread mobility across P core clusters.

Unless otherwise stated, all results here are obtained from a macOS 15.1 Sequoia VM in my free virtualiser Viable, with that VM allocated 5 virtual cores and 16 GB of memory, on a Mac mini M4 Pro running macOS Sequoia 15.1 with 48 GB of memory, 10 P and 4 E cores.

How virtual cores are allocated

All virtualised threads are treated by the host as if they are running at high Quality of Service (QoS), so are preferentially allocated to P cores, even though their original thread may be running at the lowest QoS. This has the side-effect of running virtual background processes considerably quicker than real background threads on the host.

In this case, the VM was given 5 virtual cores so they could all run in a single P cluster on the host. That doesn’t assign 5 physical cores to the VM, but runs VM threads up to a total of 500% active residency across all the P cores in the host. If the VM is assigned more virtual cores than are available in the host’s P cores, then some of its threads will spill over and be run on host E cores, but at the high frequency typical of host threads with high QoS.

Performance

There is a slight performance hit in virtualisation, but it’s surprisingly small compared to other virtualisers. Geekbench 6.3.0 benchmarks for guest and host were:

  • CPU single core VM 3,643, host 3,892
  • CPU multi-core VM 12,454, host 22,706
  • GPU Metal VM 102,282, host 110,960, with the VM as an Apple Paravirtual device.

Some tests are even closer: using my in-core floating point test, 1,000 Mloops run in the VM in 4.7 seconds, and in the host in 4.68 seconds.

Host core allocation

To assess P core allocation on the host, an in-core floating point test was run in the VM. This consisted of 5 threads with sufficient loops to fully occupy the virtual cores for about 20 seconds. In the following charts, I show results from just the first 15 seconds, as representative of the whole.

vmpcoresm4pro0

When viewed by cluster, those threads were mainly loaded first onto the first P cluster (pale blue bars), where they ran for just over 1 second before being moved to the second cluster (red bars). They were then regularly switched between the two P clusters every few seconds throughout the test. Four cycles were completed in this section of the results, with each taking 2.825 seconds, so threads were switched between clusters every 1.4 seconds, the same time as I found when running threads on the host alone, as reported previously.

For most of the 15 seconds shown here, total active residency across both P clusters was pegged at 500%, as allocated to the VM in its 5 virtual cores, with small bursts exceeding that. Thus that 500% represents those virtual cores, and the small bursts are threads from the host. Although the great majority of that 500% was run on the active P cluster, a total of about 30% active residency consisted of other threads from the VM, and ran on the less active P cluster. That probably represents the VM’s macOS background processes and overhead from its folder sharing, networking, and other Virtio device use.

vmpcoresm4pro1

When broken down to individual cores within each cluster, seen in the first above and the second below, total activity differs little across the cores in the active cluster. During its period in the active cluster, each core had an active residency of 80-100%, bringing the cluster total to about 450% while most active.

vmpcoresm4pro2

In case you’re wondering whether this occurs on older Apple silicon, and it’s just a feature of macOS Sequoia, here’s a similar example of a 4-core VM running 3 floating point threads in an M1 Max with macOS 15.1, seen in Activity Monitor’s CPU History window. There’s no movement of threads between clusters.

vmm1maxtest1

P core frequencies

powermetrics, used to obtain this data, provides two types of core frequency information. For each cluster it gives a hardware active frequency, then for each core it gives an individual frequency, which often differs within each cluster. Cores in the active P cluster were typically reported as running at a frequency of 4512 MHz, although the cluster frequency was lower, at about 3858 MHz. For simplicity, cluster frequencies are used here.

vmpcoresm4pro3

This chart shows reported frequencies for the two P clusters in the upper lines. Below them are total cluster active residencies to show which cluster was active during each period.

The active cluster had a steady frequency of just below 3,900 MHz, but when it became the less active one, its frequency varied greatly, from idle at 1,260 MHz up to almost 4,400 MHz, often for very brief periods. This is consistent with the active cluster running the intensive in-core test threads, and the other cluster handling other threads from both the VM and host.

CPU power

Several who have run VMs on notebooks report that they appear to drain the battery quickly. Using the previous results from the host, the floating point test used here would be expected to use a steady 7,000 mW.

vmpcoresm4pro4

This last chart shows the total CPU power use in mW over the same period, again with cluster active residency (here multiplied by 10), added to aid recognition of cluster cycles. This appears to average about 7,500 mW, only 500 mW more than expected when run on the host alone. That shouldn’t result in a noticeable increase in power usage in a notebook.

In the previous article, I remarked on how power used appeared to differ between the two clusters, and this is also reflected in these results. When the second cluster (P1) is active, power use is less, at about 7,100 mW, and it’s higher at about 7,700 when the first cluster (P0) is active. This needs to be confirmed on other M4 Pro chips before it can be interpreted.

Key information

  • macOS guests perform almost as well as the M4 Pro host, although multi-core benchmarks are proportionate to the number of virtual cores allocated to them. In particular, Metal GPU performance is excellent.
  • All threads in a VM are run as if at high QoS, thus preferentially on host P cores. This accelerates low QoS background threads running in the VM.
  • Virtual core allocation includes all VM overhead from the VM, such as its macOS background threads.
  • Guest threads are as mobile as those of the host, and are moved between P clusters every 1.4 seconds.
  • Although threads run in a VM incur a small penalty in additional power use, this shouldn’t be significant for most purposes.
  • Once again, evidence suggests that the first P cluster (P0) in an M4 Pro uses slightly more power than the second (P1). This needs to be confirmed in other systems.
  • powermetrics can’t be used in a VM, not unsurprisingly.

Previous article

Inside M4 chips: P cores

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

Inside M4 chips: P cores

By: hoakley
11 November 2024 at 15:30

This is the first in a series diving deeper into Apple’s new M4 family of chips. This starts with details of its Performance (P) cores. Comparisons of their performance against cores in earlier M-series chips will follow separately when I have completed them.

M4 family

There are currently three M4 designs:

  • Base M4, with 4 P and 6 E cores, also available in a cheaper variant with only 4 active E cores, and a ‘binned’ variant for iPads with only 3 active P cores.
  • M4 Pro, with 10 P and 4 E cores, also available in a ‘binned’ variant with only 8 active P cores.
  • M4 Max, with 12 P and 4 E cores, also available in a ‘binned’ variant with only 10 active P cores.

Apple is expected to release an Ultra variant in 2025, consisting of two M4 Max chips connected and working in tandem, providing a total of 24 P and 8 E cores.

Apart from the number of cores in each design, their caches and memory, all P cores are the same, and different from E cores.

P core architecture

All CPU cores are arranged in clusters of up to 6. All cores within any given cluster share L2 cache, and are run at the same frequency (clock speed). The Base M4 has a single cluster of 4 P cores, while the Pro and Max have two clusters of 5 and 6 cores respectively.

Frequency

A prominent feature of both P and E cores is their variable frequency (clock speed). In the case of P cores, this can be set to any of 17 values between the minimum of 1,260 MHz and maximum of 4,512 MHz (1.3-4.5 GHz). When running macOS, cluster frequencies are set by macOS at a kernel level; other operating systems may offer more direct control.

P cores idle at 1,260 MHz, but can also be shut down altogether. Previous M-series chips have been reported by the powermetrics command tool as sometimes being idle at a frequency of 0 MHz, but the M4 is the first to have idle and down states reported separately, for example:
CPU 4 active residency: 0.00%
CPU 4 idle residency: 0.00%
CPU 4 down residency: 100.00%

when that core and its whole cluster are shut down rather than just idling. It’s not clear whether this is merely an administrative change, or M4 cores implement this state differently from previous cores.

Instruction set

There’s confusion over the Instruction Set Architecture (ISA) supported by M4 cores. This is explained in the LLVM source, where it’s claimed that they’re “technically” ARMv9.2-A, but without its Scalable Vector Extension (SVE). Some might consider that’s closer to ARMv8.7-A, one version more recent than the M3’s ARMv8.6-A.

Although this is now fully supported in LLVM clang, it’s not clear how fully it’s supported by Xcode, for example.

Power

When shut down, a P core consumes no power, of course, and at idle with no active residency, it uses only 1-2 mW, according to measurements reported by powermetrics.

Maximum power consumption rises to approximately 1,400 mW when running intensive floating point calculations at 100% active residency, and to approximately 3,230 mW when running NEON vector instructions at 100% active residency.

macOS core allocation

Threads are normally allocated by macOS to an available P core when their designated Quality of Service (QoS) is higher than 9 (Background), for example when using Dispatch, formerly branded Grand Central Dispatch (GCD). Running threads may also be moved periodically between P cores in the same cluster, and between clusters. Previous M-series chips appear to move threads less frequently, and may leave them to run to completion after several seconds on the same core, but threads appear to be considerably more mobile when running on M4 P cores.

VM4core4threadPcoresARes

This bar chart shows 4 threads from 4 virtual CPUs in a VM running for 3 seconds at 100% active residency. For almost all that period, the threads remain running on the 4 physical cores of the first P cluster in this M1 Max, with the second P cluster remaining idle for much of that time.

The following charts show 4 threads of intensive in-core floating point arithmetic running on the P cores of an M4 Pro.

m4threads1clusters

When viewed by cluster, those threads are loaded first onto the second P cluster (red bars), where they run for 0.4 seconds before being moved to the first cluster (pale blue bars). After running there for 1.3 seconds, they’re moved back to the second cluster for a further 1.3 seconds, before completing on the first cluster.

The next two bar charts show each cluster separately, illustrating thread mobility within them.

m4threads2cluster1

When running on the first cluster (above), threads appear to be moved to a different core approximately every 0.3 second, as they do when on the second cluster (below).

m4threads3cluster2

m4threads4frequency

Cluster frequency matches this movement, with each cluster being run up to maximum frequency (shown here averaged across the whole cluster) to process the threads running on its cores. The black line below those for the P clusters shows the small changes in average frequency for the E cluster over this period.

m4threads5power

This last chart shows the total CPU power use in mW over the same period. Of particular interest here is the consistent difference in power use reported by powermetrics between the two P clusters: the first (P0) used a steady 6,000 mW when running these four threads, whereas the second (P1) used slightly less, at 5,700-5,800 mW. That could be the result of measurement error in powermetrics, peculiar to this particular chip, or could reflect an underlying difference between the two clusters.

Thread mobility makes interpreting CPU History in Activity Monitor difficult, as the fastest frequency of sampling available there is every second, while powermetrics was sampling every 0.1 second when gathering the data above. As groups of threads may be moved between clusters every 1.3 seconds or so, this can give the impression that threads are being run on both clusters simultaneously. Once again, great care is needed when interpreting the data shown by Activity Monitor.

Key information

  • Current M4 chips offer 4-12 CPU P cores.
  • M4 P cores are arranged in clusters of up to 6, sharing L2 cache and running at a common frequency.
  • P core clusters can be shut down, idling at their minimum frequency of 1,260 MHz, or at one of 18 set frequencies up to a maximum of 4,512 MHz, as controlled by macOS.
  • Their instruction set is “technically” ARMv9.2-A, but without its Scalable Vector Extension (SVE).
  • They use 1-2 mW when idle, rising to peaks of 1,400 mW (floating point) or 3,230 mW (NEON vector code).
  • macOS preferentially allocates them threads at all QoS higher than 9 (Background).
  • Threads running on M4 P cores are mobile, and may be moved to another core in the same cluster frequently, and after just over a second may be transferred to a core in the other P cluster, when available.
  • Thread mobility makes interpretation of the CPU History window in Activity Monitor very difficult.

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

Why % CPU in Activity Monitor isn’t what you think

By: hoakley
8 November 2024 at 15:30

One of the most frequently quoted measurements in Activity Monitor is % CPU. When the fans blow, or a Mac becomes sluggish, it’s often the first figure to look at and wonder why that process is pulling over 100%. That’s the first thing you learn: here, per cent doesn’t mean out of 100, but the sum of the real percentage for each core. As this Mac has eight cores, that allows it up to 800%, until you realise that Intel processors can use Hyper-Threading to pretend that each is two cores, making a maximum of 1,600%, not bad for a single processor. Then you come completely unstuck with Apple silicon.

Apple defines % CPU as “the percentage of CPU capability that’s being used”, a phrase that doesn’t appear to have any definition either in Apple’s documentation or elsewhere. So like everyone else, you assume that 100% for a core represents its maximum processing capacity. Only you couldn’t be more incorrect: in truth, the number given as % CPU is nothing like that.

Tests

To examine this more clearly, I have an app that runs tight loops of assembly code to load CPU cores on Apple silicon chips and measure their performance. For this, I got it to run several threads, each consisting of 500 million loops of floating point calculations. By running those at different Quality of Service (QoS) settings, macOS will send them to different core types.

When I run two threads at low QoS so they’re run only on E cores of a MacBook Pro M3 Pro, each takes 10.2 seconds to complete. If I instead run eight threads at high QoS, the first six of those will be run on the P cores, but the remaining two spill over onto two of the E cores, where they complete in just 3.0 seconds each.

Yet, according to % CPU in Activity Monitor, the two threads on the E cores come to 200%, and the two from eight run in the second test also come to 200%. You can see this in the CPU History window.

m3activitymon

Here are the 6 E cores at the top, and 6 P cores below. ① marks the first test on two E cores, and shows the total of 200% spread across all six E cores over the 10.2 seconds it took to complete. ② marks the second test, with all 6 P cores and two E cores hitting 100% each, for a total of 800%, as shown in Activity Monitor’s window.

By now I’m sure you guessing how running the same code on the E cores could vary in speed by a factor of 3.4 (10.2 against 3.0 seconds) although they’re the same % CPU: that measurement doesn’t take into account core frequency.

Frequency

Unlike traditional Intel CPUs, CPU cores in Apple silicon chips can be run at a wide range of frequencies, as set by macOS. To make this frequency control a bit simpler, cores are grouped into clusters, that also share L2 cache, and within each cluster all cores run at the same frequency. The M3 Pro chip consists of two clusters, one of 6 E cores, the other of 6 P cores.

When macOS loads two of those E cores with low QoS threads, it sets them to run at a low frequency to make the most of their energy efficiency. When it loads two E cores with threads that were intended to be run on P cores, as they have a high QoS, it runs the E cluster up to high frequency, so they perform better.

Measuring frequency of CPU cores is straightforward using the powermetrics command tool. When those threads were being run on E cores at low QoS, frequency was held at their minimum of 744 MHz, but run at high QoS their frequency was set to maximum, at 2748 MHz, 3.7 times faster. If full load at maximum frequency is 100% of their capability, then at low QoS they were really running at 27%, not the 100% given by Activity Monitor, and that largely accounts for the difference in time that those threads took to complete their floating point calculations.

What is % CPU?

What Activity Monitor actually shows as % CPU or “percentage of CPU capability that’s being used” is what’s better known as active residency of each core, that’s the percentage of processor cycles that aren’t idle, but actively processing threads owned by a given process. But it doesn’t take into account the frequency or clock speed of the core at that time, nor the difference in core throughput between P and E cores.

The next time that you open Activity Monitor on an Apple silicon Mac, to look at % CPU figures, bear in mind that while those numbers aren’t completely meaningless, they don’t mean what you might think, and what’s shown as 100% could be anything between 27-100%.

❌
❌