Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Take control of disks using APFS

By: hoakley
18 March 2025 at 15:30

If you want a quiet life, just format each external disk in APFS (with or without encryption), and cruise along with plenty of free space on it. For those who need to do different, either getting best performance from a hard disk, or coping with less free space on an SSD, here are some tips that might help.

File system basics

A file system like APFS provides a logical structure for each disk. At the top level it’s divided into one or more partitions that in APFS also serve as containers for its volumes. Partitions are of fixed size, although you can always repartition a disk, a process that macOS will try to perform without losing any of the data in its existing partitions. That isn’t always possible, though: if your 1 TB disk already contains 750 GB, then repartitioning it into two containers of 500 GB each will inevitably lose at least 250 GB of existing data.

All APFS volumes within any given container share the same disk space, and by default each can expand to fill that. However, volumes can also have size limits imposed on them when they’re created. Those can reserve a minimum size for that volume, or limit it to a maximum quota size.

How that logical structure is implemented in terms of physical disk space depends on the storage medium used.

Faster hard disks

Hard disks store data in circular tracks of magnetic material. To store each file requires multiple sectors of those tracks, each of which can contain 512 or 4096 bytes. As the length (circumference) of the tracks is greater as you move away from the centre of the platter towards its edge, but the disk spins at a constant number of revolutions per minute (its angular velocity is constant), it takes a shorter time for sectors at the periphery of the disk to pass under the heads than for those closer to the centre of the platter. The result is that read and write performance also varies according to where files are stored on the disk: they’re faster the further they are from the centre.

HardDiskSpeedVLocation

This graph shows how read and write speeds change in a typical compact external 2 TB hard disk as data is stored towards the centre of the disk. At the left, the outer third of the disk delivers in excess of 130 MB/s, while the inner third at the right delivers less than 100 MB/s.

You can use this to your advantage. Although you don’t control exactly where file data is stored on a hard disk, you can influence that. Disks normally fill with data from the periphery inwards, so files written first to an otherwise empty disk will normally be written and read faster.

You can help that on a more permanent basis by dividing the disk into two or more partitions (APFS containers), as the first will normally be allocated space on the disk nearest the periphery, so read and write faster than later partitions added nearer the centre. Adding a second container or partition of 20% of the total capacity of the disk won’t cost you much space, but it will ensure that performance doesn’t drop off to little more than half that achieved in the most peripheral 10%.

Reserving free space on SSDs

The firmware in SSDs knows nothing of its logical structure, and for key features manages the whole storage as a single unit. Wear levelling ensures that individual blocks of memory have similar numbers of erase-write cycles, so they age evenly. Consumer SSDs that use dynamic SLC write caches allocate those from across the whole storage, and aren’t confined to partitions. You can thus manage free space to keep sufficient dynamic cache available at a disk level.

One approach is to partition the SSD to reserve a whole container, with its fixed size, to support the needs of the dynamic cache. An alternative is to use volume reserve and quota sizes for the same purpose, within a single container. For example, in a 1 TB SSD with a 100 GB SLC write cache you could either:

  • with a single volume, set its quota to 900 GB, or
  • add an empty volume with its reserve size set to 100 GB.

Which of these you choose comes down to personal preference, although on boot volume groups you won’t be able to set a quota for its Data volume, and the most practical solution for a boot disk is to add an empty volume with a specified reserve size.

To do this when creating a new volume, click on the Size Options… button and set the quota or reserve.

Summary

  • Partition hard disks so that you only use the fastest 80% or so of the disk.
  • To reserve space in an SSD for dynamic caching, you can add a second APFS container.
  • A simpler and more flexible way to reserve space on SSDs is setting a quota size for a single volume, or adding an empty volume with a reserve size.
  • Size options can currently only be set when creating a volume.

Why SSDs slow down, and how to avoid it

By: hoakley
17 March 2025 at 15:30

Fast SSDs aren’t always fast when writing to them. Even an Apple silicon Mac’s internal SSD can slow alarmingly in the wrong circumstances, as some have recently been keen to demonstrate. This article explains why an expensive SSD normally capable of better than 2.5 GB/s write speed might disappoint, and what you can do to avoid that.

In normal use, there are three potential causes of reduced write speed in an otherwise healthy SSD:

  • thermal throttling,
  • SLC write cache depletion,
  • the need for Trimming and/or housekeeping.

Each of those should only affect write speed, leaving read speed unaffected.

Thermal throttling

Writing data to an SSD generates heat, and writing a lot can cause it to heat up significantly. Internal temperature is monitored by the firmware in the SSD, and when that rises sufficiently, writing to it will be throttled back to a lower speed to stabilise temperature. Some SSDs have proved particularly prone to thermal throttling, among them older versions of the Samsung X5, one of the first full-speed Thunderbolt 3 SSDs.

In testing, thermal throttling can be hard to distinguish from SLC write cache depletion, although thorough tests should reveal its dependence on temperature rather than the mere quantity of data written.

The only solution to thermal throttling is adequate cooling of the SSD. Internal SSDs in Macs with active cooling using fans shouldn’t heat up sufficiently to throttle, provided their air ducts are kept free and they’re used in normal ambient temperatures. Well-designed external enclosures should ensure sufficient cooling using deep fins, although active cooling using small fans remains more controversial.

SLC write cache

To achieve their high storage density, almost all consumer-grade SSDs store multiple bits in each of their memory cells, and most recent products store three in Triple-Level Cell or TLC. Writing all three bits to a single cell takes longer than it would to write them to separate cells, so most TLC SSDs compensate by using caches. Almost all feature a smaller static cache of up to 16 GB, used when writing small amounts of data, and a more substantial dynamic cache borrowed from main storage cells by writing single bits to them as if they were SLC (single-level cell) rather than TLC.

This SLC write cache becomes important when writing large amounts of data to the SSD, as the size of the SLC write cache then determines overall performance. In practice, its size ranges from around 2.5% of total SSD capacity to over 10%. This can’t be measured directly, but can be inferred from measuring speed when writing more data than can be contained in the cache. As it can’t be emptied during full-speed write, once the dynamic cache is full, write speed suddenly falls; for example a Thunderbolt 5 SSD with full-speed write of 5.5 GB/s might fall to 1.4 GB/s when its SLC write cache is full. This is seen in both external and internal SSDs.

To understand the importance of SLC write cache in determining performance, take this real-world example:

  • 100 GB is written to a Thunderbolt 5 SSD with an SLC write cache of 50 GB. Although the first half of the 100 GB is written at 5.5 GB/s, the remaining 50 GB is written at 1.4 GB/s because the cache is full. Total time for the whole write is then 44.8 seconds.
  • Performing the same to a USB4 SSD with an SLC write cache in excess of 100 GB has a slower maximum rate of 3.7 GB/s, but that’s sustained for the whole 100 GB, which then takes only 27 seconds, 60% of the time of the ‘faster’ SSD.

To predict the effect of SLC write cache size on write performance, you therefore need to know cache size, out-of-cache write speed, and the time required to empty a full cache between writes. I have looked at these on two different SSDs: a recent 2 TB model with a Thunderbolt 5 interface, and a self-assembled USB4 OWC 1M2 enclosure containing a Samsung 990 Pro 2 TB SSD. Other enclosures and SSDs will differ, of course.

The TB5 SSD has a 50 GB SLC write cache, as declared by the vendor and confirmed by testing. With that cache available, write speed is 5.5 GB/s over a TB5 interface, but falls to 1.4 GB/s once the cache is full. It then takes 4 minutes for the cache to be emptied and made available for re-use, allowing write speeds to reach 5.5 GB/s again.

The USB4 SSD has an SLC write cache in excess of 212 GB, as demonstrated by writing a total of 212 GB at its full interface speed of 3.7 GB/s. As the underlying performance of that SSD is claimed to exceed that required to support TB5, putting that SSD in a TB5 enclosure should enable it to comfortably outperform the other SSD.

Two further factors could affect SLC write cache: partitioning and free space.

When you partition a hard disk, that affects the physical layout of data on the disk, a feature sometimes used to ensure that data only uses the outer tracks where reads and writes are fastest. That doesn’t work for SSDs, where the firmware manages storage use, and won’t normally segregate partitions physically. That ensures partitioning into APFS containers doesn’t affect SLC write cache, either in terms of size or performance.

Free space can be extremely important, though. SLC write cache can only use storage that’s not already in use, and if necessary has been erased ready to be re-used. If the SSD only has 100 GB free, then that can’t all be used for cache, so limiting the size that’s available. This is another good reason for the performance of SSDs to suffer when they have little free space available.

Ultimately, to attain high write speeds through SLC write cache, you have to understand the limits of that cache and to work within them. One potential method for effectively doubling the size of that cache might be to use two SSDs in RAID-0, although that opens further questions.

Trim and housekeeping

In principle, Trim appears simple. For example, Wikipedia states: “The TRIM command enables an operating system to notify the SSD of pages which no longer contain valid data. For a file deletion operation, the operating system will mark the file’s sectors as free for new data, then send a TRIM command to the SSD.” A similar explanation is given by vendors like Seagate: “SSD TRIM is a command that optimizes SSDs by informing them which data blocks are no longer in use and can be wiped. When files are deleted, the operating system sends a TRIM command, marking these blocks as free for reuse.”

This rapidly becomes more complicated, though. For a start, the TRIM command for SATA doesn’t exist for NVMe, used by faster SSDs, where its closest substitute is DEALLOCATE. Neither is normally reported in the macOS log, although APFS does report its initial Trim when mounting an SSD. That’s reported for each container, not volume.

What we do know from often bitter experience is that some SSDs progressively slow down with use, a phenomenon most commonly (perhaps only?) seen with SATA drives connected over USB. Those also don’t get an initial Trim by APFS when they’re mounted.

It’s almost impossible to assess whether time required for Trim and housekeeping is likely to have any adverse effect on SSD write speed, provided that sufficient free disk space is maintained to support full-speed writing to the SLC write cache. Neither does there appear to be any need for a container to be remounted to trigger any Trim or housekeeping required to erase deleted storage ready for re-use, provided that macOS considers that SSD supports Trimming.

Getting best write performance from an SSD

  • Avoid thermal throttling by keeping the SSD’s temperature controlled. For internal SSDs that needs active cooling by fans; for external SSDs that needs good enclosure design with cooling fins or possibly a fan.
  • Keep ample free space on the SSD so the whole of its SLC write cache can be used.
  • Limit continuous writes to within the SSD’s SLC write cache size, then allow sufficient time for the cache to empty before writing any more.
  • It may be faster to use an SSD with a larger SLC write cache over a slower interface, than one with a smaller cache over a faster interface.
  • Avoid SATA SSDs.

I’m grateful to Barry for raising these issues.

Speed or security? Speculative execution in Apple silicon

By: hoakley
25 February 2025 at 15:30

Making a CPU do more work requires more than increasing its frequency, it needs removal of obstacles that can prevent it from making best use of those cycles. Among the most important of those is memory access. High-speed local caches, L1 and L2, can be a great help, but in the worst case fetching data from memory can still take hundreds of CPU core cycles, and that memory latency may then delay a running process. This article explains some techniques that are used in the CPU cores of Apple silicon chips, to improve processing speed by making execution more efficient and less likely to be delayed.

Out-of-order execution

No matter how well a compiler and build system might try to optimise the instructions they assemble into executable code, when it comes to running that code there are ways to improve its efficiency. Modern CPU cores use a pipeline architecture for processing instructions, and can reorder them to maintain optimum instruction throughput. This uses a re-order buffer (ROB), which can be large to allow for greatest optimisation. All Apple silicon CPU cores, from the M1 onwards, use out-of-order execution with ROBs, and more recent families appear to have undergone further improvement.

In addition to executing instructions out of order, many modern processors perform speculative execution. For example, when code is running a loop to perform repeated series of operations, the core will speculate that it will keep running that loop, so rather than wait to work out whether it should loop again, it presses on. If it then turns out that it had reached the end of the loop phase, the core rolls back to where it entered the loop and follows the correct branch.

Although this wastes a little time on the last run of each loop, if it’s had to loop a million times before that, accumulated time savings can be considerable. However, on its own speculative execution can be limited by data that has to be loaded from memory in each loop, so more recently CPU cores have tried to speculate on the data they require.

Load address prediction

One common pattern of data access within code loops is in their addresses in memory. This occurs when the loop is working through a stored array of data, where the address of each item is at a constant address increment. For this, the core watches the series of addresses being accessed, and once it detects that they follow a regular pattern, it performs Load Address Prediction (LAP) to guess the next address to be used.

The core then performs two functions simultaneously: it proceeds to execute the loop using the guessed address, while continuing to load the actual address. Once it can, it then compares the predicted and actual addresses. If it guessed correctly, it continues execution; if it guessed wrong, then it rolls back in the code, uses the actual address, and resumes execution with that instead.

As with speculative execution, this pays off when there are a million addresses in a strict pattern, but loses out when a pattern breaks.

Load value prediction

LAP only handles addresses in memory, whose contents may differ. In other cases, values fetched from memory can be identical. To take advantage of that, the core can watch the value being loaded each time the code passes through the loop. This might represent a constant being used in a calculation, for example.

When the core sees that the same value is being used each time, it performs Load Value Prediction (LVP) to guess the next value to be loaded. This works essentially the same as LAP, with comparison between the predicted and actual values used to determine whether to proceed or to roll back and use the correct value.

This diagram summarises the three types of speculative execution now used in Apple silicon CPU cores, and identifies which families in the M-series use each.

Vulnerabilities

Speculative execution was first discovered to be vulnerable in 2017, and this was made public seven years ago, in early 2018, in a class of attack techniques known as Spectre. LAP and LVP were demonstrated and exploited in SLAP and FLOP in 2024-25.

Mechanisms for exploiting speculative designs are complex, and rely on a combination of training and misprediction to give an attacker access to the memory of other processes. The only robust protection is to disable speculation altogether, although various types of mitigation have also been developed for Spectre. The impact of disabling speculative execution, LAP or LVP greatly impairs performance in many situations, and isn’t generally considered commercially feasible.

Risks

The existence of vulnerabilities that can be exploited might appear worrying, particularly as their demonstrations use JavaScript running in crafted websites. But translating those into a significant risk is more challenging, and a task for Apple and those who develop browsers to run in macOS. It’s also a challenge to third-parties who develop security software, as detecting attempts to exploit vulnerabilities in speculative behaviour is relatively novel.

One reason we haven’t seen many (if any) attacks using the Spectre family of vulnerabilities is that they’re hardware specific. For an attacker to use them successfully on a worthwhile proportion of computers, they would need to detect the CPU and run code developed specifically for that. SLAP and FLOP are similar, in that neither would succeed on Intel or M1 Macs, and FLOP requires the LVP support of an M3 or M4. They’re also reliant on locating valuable secrets in memory. If you never open potentially malicious web pages when your browser already has exploitable pages loaded, then they’re unlikely to be able to take advantage of the opportunity.

Where these vulnerabilities are more likely to be exploited is in more sophisticated, targeted attacks that succeed most when undetected for long periods, those more typical of surveillance by affiliates of nation-states.

In the longer term, as more advanced CPU cores become commonplace, risks inherent in speculative execution can only grow, unless Apple and other core designers address these vulnerabilities effectively. What today is impressive leading-edge security research will help change tomorrow’s processor designs.

Further reading

Wikipedia on out-of-order execution
Wikipedia on speculative execution
SLAP and FLOP, with their original papers

macOS Sequoia 15.3 has improved Thunderbolt 5 performance

By: hoakley
4 February 2025 at 15:30

There have been many reports of problems with Thunderbolt 5 support in Apple’s latest MacBook Pro and Mac mini models with M4 Pro and Max chips. Among the more concerning have been poor performance when accessing SSDs through TB5 docks and hubs, and the inability to drive more than two 4K displays through those. This article looks at what has changed, and what can currently be achieved when accessing SSDs either directly or via TB5 docks or hubs.

When I last tested these, using a Mac mini M4 Pro and Sequoia 15.2, I found that speeds measured through a TB5 dock were generally at least as good as those through a TB4 hub, with three notable exceptions:

  • Write speed from a TB5 port to a TB3 SSD through a TB5 dock fell to 0.42 GB/s, little more than 10% of that of a direct connection and similar to that expected from a SATA SSD operating over USB 3.2 Gen 2.
  • Write speed from a TB5 port to a USB4 SSD through a TB5 dock fell to 2.3 GB/s, about 62% of that expected.
  • Write speeds to a TB3 SSD through a TB5 dock occur at about half the expected speed, just as those through a TB4 hub.

Methods

Three Macs were used for testing:

  • iMac Pro (Intel, T2 chip) with macOS 15.1.1, over a Thunderbolt 3 port without USB4 support.
  • MacBook Pro (M3 Pro) with macOS 15.2, over a Thunderbolt 4/USB4 port.
  • Mac mini (M4 Pro) with macOS 15.3, over a Thunderbolt 5 port.

The results for the first two are taken from my previous tests, and here used for comparison.

The dock used was the Kensington SD5000T5 Thunderbolt 5 Triple Docking Station, with a total of three downstream TB5 ports. I’m very grateful to Sven who has provided his results from an OWC TB5 hub to support those from the dock.

Other methods are the same as those described previously. The TB5 SSD tested is one of the three currently available or on pre-order from OWC, Sabrent and LaCie (no, I’m not going to tell you which, as I’m still in the process of reviewing it).

Single SSDs

Results obtained from measuring read and write speeds on a single SSD at a time are summarised in the table below. Those that are concerning are set in bold italics.

Performance of the TB5 SSD when connected direct or through the dock was highest of all, and around 150% of the speeds achieved by the next fastest, the USB4 SSD, and around 180-250% those of the TB3 SSD, the slowest. Direct connection of USB4 SSDs to the TB5 port in macOS 15.3 resulted in even faster speeds than a TB4/USB4 connection using 15.2. Thus, a TB5 port with 15.3 delivers best performance over all types of external SSD tested here.

Of the three exceptionally poor results seen previously:

  • Write speed from a TB5 port to a TB3 SSD through a TB5 dock improved greatly from 0.42 GB/s to 1.6 GB/s, the same as in other Macs.
  • Write speed from a TB5 port to a USB4 SSD through a TB5 dock improved from 2.3 GB/s to 3.8 GB/s, the same as when connected direct.
  • Write speeds to a TB3 SSD through a TB5 dock remained at 1.6 GB/s, about half the expected speed, just as those through a TB4 hub.

This anomalous behaviour when writing to a TB3 SSD through a TB5 dock was also found by Sven in his tests on the OWC TB5 hub, and seems common to most if not all TB4 and TB5 docks and hubs. I haven’t seen any explanation as to why it occurs so widely.

Paired SSDs

Encouraged by these substantial improvements with Sequoia 15.3, I measured simultaneous read and write speeds to a pair of USB4 SSDs connected to the Kensington TB5 dock. Stibium has a GUI so can’t perform this in perfect synchrony. However, it reads or writes a total of 160 files in 53 GB during each of these tests, and outlying measurements are discounted using the following robust statistical techniques:

  • a 20% trimmed mean, giving the 20th and 80th centiles;
  • Theil-Sen regression;
  • linear regression through all measured values, returning a rate and latency.

Measured transfer rates in each of the two USB4 SSDs are given in the table below.

The first row of results gives the two write speeds measured simultaneously when both the SSDs were writing, similarly the second gives the two read speeds for simultaneous reading, and the bottom line shows speeds when one SSD was writing and the other reading at the same time.

When both SSDs were transferring their data in the same direction, individual speeds were about 3.1 GB/s, but when the directions of transfer were mixed, with one reading and the other writing, their speeds were similar to a single USB4 SSD. Total transfer speed was thus about 6.2 GB/s when in the same direction, but 7.2 GB/s when in opposite directions.

Multiple displays

Many of those buying into TB5 are doing so early because of its promised support for multiple displays. I haven’t yet seen sufficient evidence to decide whether this has improved with Sequoia 15.3. However, OWC has qualified full display support of its TB5 hub as requiring “native Thunderbolt 5 display or other displays that support USB-C connections and DisplayPort 2.1”. One likely reason for multiple displays not achieving support expected, such as three 4K at 144 Hz, is that they don’t support DisplayPort 2.1.

Which macOS?

As the evidence here suggests, macOS 15.3 or later is required for full TB5 performance, and OWC now includes that in the specifications for its TB5 hub. It also states that TB3 support requires macOS 15, although USB4 should still be supported in macOS 14 Sonoma.

Recommendations

  • TB5 SSDs are faster than USB4, which are faster than TB3, in almost every combination. The only exception to this is a USB4 SSD connected direct to a TB3 port, which is likely to be limited to 1.0 GB/s in both directions.
  • When pricing allows, prefer purchasing a ready-made TB5 SSD. If it’s to be used with an Intel Mac, confirm that it there supports TB3.
  • Self-assembly TB5 enclosures remain expensive at present, and a USB4 enclosure may then prove better value, provided that it won’t be used with an Intel Mac.
  • Avoid writing to a TB3 SSD connected to a dock or hub, as its speed is likely to be limited to 1.6 GB/s.
  • Ensure Macs with TB5 ports are updated to Sequoia 15.3 or later.
  • Ensure Macs to be used with TB5 docks or hubs are updated to Sequoia 15 or later, or they may not fully support TB3.

Friday magic: how to cheat on E cores and get free performance

By: hoakley
31 January 2025 at 15:30

I’d love to be able to bring you a Mac magic trick every Friday, but they aren’t so easy to discover. Today’s is mainly for those with Apple silicon Macs, and is all about gaming the way that macOS allocates threads to their cores. Here, I’ll show you how to more than double the performance of the E cores at no cost to the P cores.

To do this, I’m using my test app AsmAttic to run two different types of core-intensive threads, one performing floating point maths including a fused multiply-add operation, the other running the NEON vector processor flat out.

When I run a single NEON thread of a billion loops at low Quality of Service (QoS) so that it’s run on the E cores, it takes 2.61 seconds to complete, instead of the 0.60 seconds it takes on a P core. But how can I get that same thread, running on the E cluster, to complete in only 1.03 seconds, 40% of the time it should take, and closer to the performance of a P core?

The answer is to run 11 more threads of 3 billion loops of floating point at the same time. That might seem paradoxical, particularly when those additional threads perform the same with or without that NEON thread on the E cores, so come for free. Perhaps I’d better explain what’s going on here.

Normally, when you run these threads at low QoS, macOS runs them on the E cores, at low frequency for added efficiency. On the M4 Pro used for these tests, the NEON test that took 2.61 seconds was on E cores pottering along at a frequency of less than 1,100 MHz across the whole of the E cluster, not much faster than their idle frequency of 1,020 MHz.

One way to get macOS to increase the frequency of all the E cores in the cluster is to persuade it to run a thread with high QoS that won’t fit onto the P cores. In this M4 Pro, that means loading its CPU with 11 floating point threads, of which 10 will be run on the two clusters of 5 P cores each. That leaves the eleventh thread to go on the E cluster. macOS then kindly increases the frequency of the E cluster to around 2,592 MHz, giving my NEON thread a speed boost of around 235%, which accounts for the performance increase I observed.

These two tests are shown in the CPU History window from Activity Monitor. The single NEON thread run alone in the E cluster is marked by 1, when there was essentially no activity in the P cores. The figure 2 marks when the same NEON thread was run while all 10 P cores and one of the E cores were running the floating point maths. Yet the NEON thread at 2 completed in less than half the time of that at 1.

With just two substantial threads running on the E cluster, there’s just as much processing power as when there was just the one floating point thread. So the 11 floating point threads complete in the same time, regardless of whether the NEON thread is also running. Therefore this extra performance comes free, with nothing else being slowed to compensate.

Of course in the real world, this sort of effect is likely to be extremely rare. But it might account for the occasional unexpectedly good performance of a background thread running at low QoS, and I can’t see any downsides either.

The other way you could get that low QoS thread to perform far better would be running it in a Virtual Machine, as that runs everything on P cores regardless of their QoS. Sadly, despite searching extensively, I still haven’t discovered any other way of convincing macOS to run low QoS threads any faster, except by magic.

M4 Pro full on: when CPU and GPU draw over 50 W, and how Low Power mode changes that

By: hoakley
22 January 2025 at 15:30

Most testing and benchmarks avoid putting heavy loads on CPU and GPU at the same time, so running an Apple silicon chip ‘full on’. This article explores what happens in the CPU and GPU of an M4 Pro when they’re drawing a total of over 50 W, and how that changes in Low Power mode. It concludes my investigations of power modes, for the time being.

Methods

Three test runs were performed on a Mac mini M4 Pro with 10 P and 4 E cores, and a 20-core GPU. In each run, Blender Benchmarks were run using Metal, and shortly after the start of the first of those, monster, 3 billion tight loops of NEON code were run on CPU cores at maximum Quality of Service in 10 threads. From previous separate runs, the monster test runs the GPU at its maximum frequency of 1,578 MHz and 100% active residency, to use about 20 W, and that NEON code runs all 10 P cores at high frequency of about 3,852 MHz and 100% active residency to use about 32 W. This combined testing was performed in each of the three power modes: Low Power, Automatic, and High Power.

In addition to recording test performance, powermetrics was run during the start of each NEON test at its shortest sampling period, with both cpu_power and gpu_power samplers active.

Performance

There was no difference in performance between High Power and Automatic settings, which completed both tasks with the same performance as when they were run separately:

  • NEON time separate 2.12 s, together High Power 2.12 s, Auto 2.12 s
  • monster performance separate 1215-1220, together High Power 1221, Auto 1220.

As expected, Low Power performance was greatly reduced. NEON time was 4.33 s (49% performance), even slower than running alone at Low Power (2.87 s), and monster performance 795, slightly lower than running alone at Low Power (837).

High Power mode

This first graph shows CPU core cluster frequencies and active residencies for a period of 0.3 seconds when the monster test was already running, and the NEON test was started.

At time 0, the P0 cluster (black) was shut down, and the P1 cluster (red) running with one core at 100% active residency, a second at about 60%, and at about 3,900 MHz. As the ten test threads were loaded onto the two clusters, cluster frequencies were quickly brought to 3,852 MHz, by reducing that of the P1 cluster and rapidly increasing that of the P0 cluster.

By 0.1 seconds, both clusters were at full active residency and running at 3,852 MHz, where they remained until the NEON test threads completed.

Power used by the CPU followed the same pattern, rising rapidly from about 6,000 mW to about 32,000 mW at 0.1 seconds. GPU power varied between 8,600-23,000 mW, resulting in a peak total power of slightly less than 52,000 mW, and a dip to 40,600 mW. Typical sustained power with both CPU and GPU tests running was 50-52 W.

Low Power mode

These results are more complicated, and involve significant use of the E cluster.

This graph shows active residency alone, and this time includes the E cluster, shown in blue, and the GPU, in purple. NEON test threads were initially loaded into the two P clusters, filling them at 0.13 seconds. After that, threads were moved from some of those P cores to run on E cores instead, leaving just two test threads running on each of the P clusters by 0.26 seconds. Over much of that time the GPU had full active residency, but as that fell threads were moved from E cores back to P cores. By the end of this period of 0.5 seconds, 4 of 5 cores in each of the two P clusters were at 100%, and the GPU was also at 100% active residency.

This bar chart shows changing cluster total active residency for the E (red) and two P (blues) clusters by sample. With 10 test threads and significant overhead, the total should have reached at least 1,000%, which was only achieved in sample 4, and from sample 13 onwards.

Those active residencies are shown in the lower section of this graph (with open circles), together with cluster frequencies (filled circles) above them. As the P clusters were being loaded with test threads, both P clusters (black) were brought to a frequency of only 1,800 MHz, compared with 3,852 MHz in the High Power test. The E cluster (blue) was run throughout at its maximum frequency of 2,592 MHz, except for one sample period. GPU frequency (purple) remained below 1,000 MHz throughout, compared with a steady maximum of 1,578 MHz when at High Power.

Power changed throughout this initial period running the NEON test. Initially, CPU power (red) rose to a peak of 6,660 mW, then fell slowly to 3,500 mW before rising again to about 6,000 mW. GPU power rose to peak at just over 7,000 mW, but at one stage fell to only 26 mW. Total power used by the CPU and GPU ranged between 11-13.2 W, apart from a short period when it fell below 5 W. Those are all far lower than the steadier power use in High Power mode.

How macOS limits power

Running these tests in Low Power mode elicited some of the most sophisticated controls I have seen in Apple silicon chips. Compared to being run unfettered in Automatic or High Power mode, macOS used a combination of strategies to keep CPU and GPU total power use below 13.5 W:

  • P core frequencies were limited to 1,800 MHz, instead of 3,852 MHz.
  • High QoS threads that would normally have been run on P cores were transferred to E cores, which were then run at their maximum frequency of 2,592 MHz.
  • Threads continued to be transferred between E and P cores to balance performance against power use.
  • GPU frequency was limited to below 1,000 MHz.
  • Despite reducing power use to a total of 25% of High Power mode, effects on performance were far less, attaining about 50% of that at High Power mode.

References

How Low Power mode controls CPU cores
Power Modes and Apple silicon CPUs
Last Week on My Mac: Power throttle
Inside M4 chips: CPU power, energy and mystery
Inside M4 chips: Matrix processing and Power Modes
Power Modes and Apple Silicon GPUs
Evaluating M3 Pro CPU cores: 1 General performance

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

What can you do with virtualised macOS on Apple silicon?

By: hoakley
21 January 2025 at 15:30

If you want to run an older version of macOS that your Mac doesn’t support, so can’t boot into, then the only option is to run it within your current macOS. You may be able to do that using one of two methods: virtualisation or emulation. Emulation is normally used when the older macOS runs on a different processor, while virtualisation may be available when the processors share the same architecture.

Running Intel macOS

If you want to run a version of macOS for Intel processors, including Catalina and earlier, then your Apple silicon Mac has to run a software emulation of an Intel processor for that to work. Although this is possible using UTM, emulation is slow and not reliable enough to make this feasible for everyday use. It’s impressive but not practical at present. For an Apple silicon Mac, that automatically rules out running any macOS before Big Sur, the first version with support for Arm processors.

Rosetta 2 for Apple silicon Macs isn’t an emulator, although it allows you to run code built for Intel Macs on Apple silicon. It achieves that by translating the instructions in that code to those for the Arm cores in Apple silicon chips. This is highly effective and in most cases runs at near-native speed, but Rosetta 2 can’t be used to translate operating systems like macOS, so can’t help with running older macOS.

Virtualisation

Virtualisation is far more practical than emulation, as it doesn’t involve any translation, and most code in the virtualised operating system should run direct on your Mac’s CPU cores and GPU. What is required for virtualisation to work is driver support to handle access to devices such as storage, networks, keyboards and other input devices. Those enable apps running in macOS inside the virtual machine (VM), the guest, to use features of the host Mac.

Virtualisation of macOS, Windows and Linux have been relatively straightforward in the past on Intel Macs, as they’re essentially PCs, and providing driver support for guest operating systems has been feasible. That has changed fundamentally with Apple silicon chips, where every hardware device has its own driver unique to Apple, and undocumented. Without devoting huge resources to the project, it simply isn’t feasible for third-parties to develop their own virtualisation of macOS on Apple silicon.

Recognising this problem, Apple has adopted a solution that makes it simple to virtualise supported macOS (and Linux) using a system of Virtio drivers. Those have been progressively written into macOS so that it works both as a guest and host, for services that are supported by a Virtio driver, and all versions of macOS since Monterey have been able to virtualise Monterey and later when running on Apple silicon.

The drawback is that, although features supported by Virtio drivers are readily implemented in virtualisers, those that aren’t can’t be supported unless Apple builds a new Virtio driver into macOS. Even then, that new feature will only be available in that and later versions of macOS on both host and guest, as support is needed in both before it can work.

Another important consequence of virtualisation being built into macOS is that different virtualising apps all rely on the same features, and act as wrappers for macOS. While different apps may offer different sets of features and present them in their own interface, virtualisation is identical inside them. I’m not aware of any macOS virtualiser on Apple silicon that doesn’t use the API in macOS, and they all share its common limitations and strengths. This also means that, when there’s a bug in virtualisation within macOS, it affects all virtualisers equally.

macOS support

Early Virtio support appeared first in macOS Mojave and gathered pace through Catalina and Big Sur, but the first version of macOS to support virtualisation of macOS on Apple silicon Macs was Monterey 12.0. That means that no Mac released after the release of Monterey in the autumn/fall of 2021 will ever be able to run Big Sur, as their hardware isn’t supported by it, and macOS 11 can’t be virtualised. The only way to retain access to Big Sur is to keep an M1 Mac that shipped with it, the last of which was the iMac 24-inch M1 of 2021. However, it also means that the latest M4 Macs can run Monterey in a virtual machine, even though the oldest macOS they can boot into is Sequoia.

When the host or guest macOS is Monterey, sharing folders between them isn’t supported, and the only way to share is through network-based file sharing, which is less convenient. Display support was enhanced in Ventura, which again is required on both guest and host for it to be available.

Support for iCloud and iCloud Drive access didn’t become available in VMs until Sequoia, and now requires that both the guest and host must be running macOS 15.0 or later. As VMs that support these features are structurally different from earlier VMs, this also means those VMs that have been upgraded from an earlier macOS still can’t support iCloud or iCloud Drive. Only those built from the start to support Sequoia on a Sequoia host can support them.

Virtualisation can also have limited forward support, and is widely used to run beta-releases of the next version of macOS. This should be straightforward within the same major version, but testing betas for the next major version commonly requires the installation of additional support software. However, support for running betas is less reliable, and may require creation of a new VM rather than updating.

viable12n13

Many aren’t aware that Apple’s macOS licences do cover its use in VMs, in Section 2B(iii), where there’s a limit of two macOS VMs that can be running on a Mac at any time. This is enforced by macOS, and trying to launch a third will be blocked. For the record, the licence also limits the purposes of virtualisation to “(a) software development; (b) testing during software development; (c) using macOS Server; or (d) personal, non-commercial use.” It’s worth noting that Apple discontinued macOS Server on 21 April 2022, and it’s unsupported for any macOS more recent than Monterey.

Major limitations

The greatest remaining limitation in virtualising macOS on Apple silicon is its complete inability to run apps from the App Store, apart from Apple’s Pages, Numbers and Keynote, when copied across from the host. Even free apps obtained from the App Store can’t be run, although independently distributed apps are likely to be fully supported. This appears to be the result of Apple’s authorisation restrictions, and unless Apple rethinks and reengineers its store policies, it looks unlikely to change.

Some lesser features remain problems. For example, network connections from a VM are always treated as being Ethernet, and there’s no support for them as Wi-Fi. Audio also remains odd, and appears to be only partially supported. Although Sequoia has enabled support for storage devices, earlier macOS was confined to the VM’s disk image and shared folders. Trackpads don’t always work as smoothly as on the host, particularly in older versions of macOS.

Strengths

One of the most important features is full support for running Intel apps using Rosetta 2.

That and other performance is impressive. CPU core-intensive code runs at almost identical speed to that on the host. Geekbench 6 single-core performance is 99% that of the host, although multi-core performance is of course constrained by the number of cores allocated to the VM. Unlike most Intel virtualisers, macOS VMs attain GPU Metal performance only slightly less than the host, with Geekbench 6 Metal down slightly to 94% that of the host.

VMs are mobile between Macs, even when built to support iCloud and iCloud Drive access. Because each VM is effectively self-contained, this is an excellent way to provide access to a customised suite of software with its own settings. As disk images, storage in VMs is normally in sparse file format, so takes a lot less disk space than might be expected. It’s also quick and simple to duplicate a VM for testing, then to delete that duplicate, leaving the original untouched.

Future

Virtualising macOS on Apple silicon has relatively limited value at present, but in the future will become an essential feature for more users. Currently it’s most popular with developers who need to test against multiple versions of macOS, and with researchers, particularly in security, who can lock a VM down with its security protection disabled.

Few apps that ran in Big Sur or Monterey are no longer compatible with Sequoia. As macOS is upgraded and newer Macs are released, that will change, and virtualisation will be the only way of running those apps in the future, much as virtualisation on Intel Macs is for older macOS.

There will also come a time when Apple discontinues support for Rosetta 2 in the current macOS. When that happens, virtualisation will become the only way to run Intel apps on Apple silicon.

However, until App Store apps can run in VMs, for many the future of virtualisation will remain constrained.

Summary

macOS VMs on Apple silicon can:

  • run Monterey and later on any model, but not Big Sur or Intel macOS;
  • run most betas of the next release of macOS;
  • run Intel apps using Rosetta 2;
  • deliver near-normal CPU and GPU performance;
  • access iCloud and iCloud Drive only when both host and guest are running Sequoia or later;
  • but they can’t run any App Store apps except for Pages, Numbers and Keynote.

What are CPU core frequencies in Apple silicon Macs?

By: hoakley
20 January 2025 at 15:30

One of the features of CPU cores in Apple silicon Macs is that they aren’t run at a single standard frequency or clock speed, but that varies depending on macOS. Moreover, those frequencies not only differ between generations, so aren’t the same in M2 chips as in the M1, but they also differ between variants within the same family. This article gives frequencies for each of the chips released to date, and considers how and why they differ. This has only been made possible by the many readers who generously gave their time to provide me with this information: thank you all.

The most reliable method of discovering which frequencies are available is using the command tool powermetrics. This lists frequencies for P and E cores, and this article assumes that those it gives are correct. Although it’s most likely that these frequencies aren’t baked into silicon, so could be changed, I’ve seen no evidence to suggest that Apple has done that in any release Mac.

Frequencies

If powermetrics is to be believed, then the maximum frequencies of each of the CPU cores used in each generation differ from some of those you’ll see quoted elsewhere. Correct values should be:

  • M1 E 2064 MHz or 2.1 GHz; P 3228 MHz or 3.2 GHz;
  • M2 E 2424 MHz or 2.4 GHz; P 3696 MHz or 3.7 GHz;
  • M3 E 2748 MHz or 2.7 GHz; P 4056 MHz or 4.1 GHz;
  • M4 E 2892 MHz or 2.9 GHz; P 4512 MHz or 4.5 GHz.

However, not all variants within a family can use those maximum frequencies. The full table of frequencies reported by powermetrics is:

This is available for download as a Numbers spreadsheet and in CSV format here: mxfreqs

Why those frequencies?

Depending on workload, thread Quality of Service, power mode, and thermal status, macOS sets the frequency for each cluster of CPU cores. Those used range between the minimum or idle, and the maximum, usually given as the core’s ‘clock speed’ and an indication of its maximum potential performance. In between those are as many as 17 intermediate frequencies giving cores great flexibility in performance, power and energy use. Core design and development uses sophisticated models to select idle and maximum frequencies, and undoubtedly to determine those in between.

Looking at the table, it would be easy to assume those numbers are chosen arbitrarily, but when expressed appropriately I think you can see there’s more to them. To look at frequency steps and the frequencies chosen for them, let me explain how I have converted raw frequencies to make them comparable.

First, I work out the steps as evenly spaced points along a line from 0.0, representing idle, to 1.0, representing the core’s maximum frequency. For each of those evenly spaced steps, I calculate a normalised frequency, as
(FmaxFstep)/(FmaxFidle)
where Fidle is the idle (lowest) frequency value, Fmax is the highest, and Fstep is the actual frequency set for that step.

For example, say a core has an idle frequency of 500 MHz, a maximum of 1,500 MHz, and only one step between those. Its steps will be 0.0, 0.5 and 1.0, and if the relationship is linear, then the frequency set by that intermediate step will be 1,000 MHz. If it’s greater than that, the relationship will be non-linear, tending to higher frequency for that step.

I’ll start with the E cores, as they’re simplest and have fewer steps.

E cores

For the M1, Apple didn’t try any tricks with the frequency of its E cores. There are just three intermediate steps, evenly spaced at 0.25, 0.5 and 0.75, and that’s the same with all E cores regardless of variant, from the base up to the Ultra.

With the M2, shown here in red, Apple added an extra step, and in the base M2 there’s also a lower idle frequency, not shown here. What is obvious is that those intermediate frequencies have been increased relative to those of the M1, turning the straight line into a curve.

The M3, shown here in blue, and M4, in purple, deviate even further from the line of the M1, with more steps and relatively higher frequencies.

This shows progress from the M1 in black to the M4 in purple, whose frequencies follow the polynomial shown.

Across the families, intermediate frequencies are most apparent in the E cores, where background threads are run at lower frequencies, and high-QoS threads that should have been run on P cores are run at higher frequencies. In M1 Pro and Max variants, with their two-core E clusters, macOS increases the E cluster frequency when they are running two threads to improve performance and compensate for their small cluster size.

P cores

With P core frequencies, the initial design for the M1 is different. The majority of the frequency steps follow a straight line still, but with a steeper gradient (1.23 against 1.00). Then in the upper quarter of the frequency range, above the step at 0.71, that line eases off to the maximum. This gives finer control of frequency over higher frequencies, and those higher frequencies are also reduced slightly in the base M1 compared to those here from the Pro, Max and Ultra.

In the M2 family, Apple divided frequencies into two: base and Pro variants have two less steps, with the base having a lower idle frequency. Shown here in red are those for the M2 Max, which are faired into a polynomial curve. That increases frequencies lower down, reduces them slightly at the upper end, then has a significantly higher maximum frequency.

Apple continued to tweak the P curves in the M3 (blue) and M4 (purple), with increasing numbers of steps but the same finer control at the upper end.

Here’s the comparison between M1 Max and M4 Max, with the same underlying ideas, but substantial differences. In the M4, each of the three variants released so far is different. The base M4 has a lower idle and maximum, the M4 Pro has a higher idle and maximum but one less step between them, and the M4 Max adds another step to the Pro’s series.

Significance

Apple’s engineers have clearly put considerable effort into picking optimised frequencies for each of the families and variants within them. If you still think that this is all fine detail and only the maximum frequencies count, then you might like to ponder why so much care has gone into selecting those intermediate frequencies, and how they’ve changed since the M1. Both P and E cores spend a lot of their time running at these carefully chosen frequencies.

❌
❌