Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Comparing in-core performance of Intel, M3 and M4 CPU cores

By: hoakley
16 May 2025 at 14:30

It has been a long time since I last compared performance between CPU cores in Intel and Apple silicon Macs. This article compares six in-core measures of CPU performance across four different models, two with Intel processors, an M3 Pro, and an M4 Pro.

If you’re interested in comparing performance across mixed code modelling that in common apps, then look no further than Geekbench. The purpose of my tests isn’t to replicate those, but to gain insight into the CPU cores themselves, when running tight number-crunching loops largely using their registers and accessing memory as little as possible. This set of tests lays emphasis on those run at low Quality of Service (QoS), thus on the E cores of Apple silicon chips. Although those run relatively little user code, they are responsible for much of the background processing performed by macOS, and can run threads at high QoS when there are no free P cores available, although they do that at higher frequencies to deliver better performance.

Methods

Testing was performed on four Macs:

  • iMac Pro 2017, 3.2 GHz 8-core Intel Xeon W, 32 GB memory, Sequoia 15.3.2;
  • MacBook Pro 16-inch 2019, 2.3 GHz 8-core Intel Core i9, 16 GB memory, Sequoia 15.5;
  • MacBook Pro 16-inch 2023, M3 Pro, 36 GB memory, Sequoia 15.5;
  • Mac mini 2024, M4 Pro, 48 GB memory, Sequoia 15.5.

Six test subroutines were used in a GUI harness, as described in many of my previous articles. Normally, those include tests I have coded in Arm Assembly language, but for cross-platform comparisons I rely on the following coded in Swift:

  • float mmul, direct calculation of 16 x 16 matrix multiplication using nested for loops on Floats.
  • integer dot product, direct calculation of vector dot product on vectors of 4 Ints.
  • simd_float4 calculation of the dot-product using simd_dot in the Accelerate library.
  • vDSP_mmul, a function from the vDSP sub-library in Accelerate, multiplies two 16 x 16 32-bit floating point matrices, which in M1 and M3 chips appears to use the AMX co-processor;
  • SparseMultiply, a function from Accelerate’s Sparse Solvers, multiplies a sparse and a dense matrix, and may use the AMX co-processor in M1 and M3 chips.
  • BNNSMatMul matrix multiplication of 32-bit floating-point numbers, here in the Accelerate library, and since deprecated.

Source code for the last four is given in the appendix to this article.

Each test was run first in a single thread, then in four threads simultaneously. Loop throughput per second was calculated from the average time taken for each of the four threads to complete, and compared against the single thread to ensure it was representative. Results are expressed as percentages compared to test throughput at high QoS on the iMac Pro set at 100%. Thus a test result reported here as 200% indicates the cores being tested completed calculations in loops at twice the rate of those in the cores of the iMac Pro, so are ‘twice the speed’.

High QoS

User threads are normally run at high QoS, so getting the best performance available from the CPU cores. In Apple silicon chips, those threads are run preferentially on P cores at high frequency, although that may not be at the core’s maximum. Results are charted below.

Each cluster of bars here shows loop throughput for one test relative to the iMac Pro’s 3.2 GHz 8-core Xeon processor at 100%. Pale blue and red bars are for the two Intel Macs, the M3 Pro is dark blue, and the M4 Pro green. The first three tests demonstrate what was expected, with an increase in performance in the M3 Pro, and even more in the M4 Pro to reach about 200%.

Results from vDSP matrix multiplication are different, with less of an increase in the M3 Pro, and a reduction in the M4 Pro. This may reflect issues in the code used in the Accelerate library. That contrasts with the huge increases in performance seen in the last two tests, rising to a peak of over 400% in BNNS matrix multiplication.

With that single exception, P cores in recent Apple silicon chips are out-performing Intel CPU cores by wider margins than can be accounted for in terms of frequency alone.

Low QoS

When expressed relative to loop throughput at high QoS, no clear trend emerges in Apple silicon chips. This reflects the differences in handling of threads run at low QoS: as the Intel CPUs used in Macs only have a single core type, they can only run low QoS threads at lower priority on the same cores. In Apple silicon chips, low QoS threads are run exclusively on E cores running at frequencies substantially lower than their maximum, for energy efficiency. This is reflected in the chart below.

In the Intel Xeon W of the iMac Pro, low QoS threads are run at a fairly uniform throughput of about 45% that of high QoS threads, and in the Intel Core i9 that percentage is even lower, at around 35%. Throughput in Apple silicon E cores is more variable, and in the case of the last test, the E cores in the M4 Pro reach 66% of the throughput of the Intel Xeon at high QoS. Thus, Apple appears to have chosen the frequencies used to run low QoS threads in the E cores to deliver the required economy rather than a set level of performance.

Conclusions

  • CPU P core performance in M3 and M4 chips is generally far superior to CPUs in late Intel Macs.
  • Performance in M3 P cores is typically 160% that of a Xeon or i9 core, rising to 330%.
  • Performance in M4 P cores is typically 190% that of a Xeon or i9 core, rising to 400%.
  • Performance in E cores when running low QoS threads is more variable, and typically around 30% that of a Xeon or i9 core at high QoS, to achieve superior economy in energy use.
  • On Intel processors running macOS Sequoia, low QoS threads are run significantly slower than high QoS threads, at about 45% (Xeon) or 30-35% (i9).

My apologies for omitting legends from the first version of the two charts, and thanks to @holabotaz for drawing my attention to that error, now corrected.

What is Quality of Service, and how does it matter?

By: hoakley
9 May 2025 at 14:30

In computing, the term Quality of Service is widely used to refer to communication and network performance, but for Macs it has another more significant meaning, as the property that determines the performance of each thread run on your Mac, most importantly in Apple silicon chips.

Processes and threads

Each process running on your Mac consists of at least one thread. Threads are single flows of code execution run on one CPU core at a time, sharing virtual memory allocated to that process, but with their own stack. In addition to the process’s main thread, it can create additional threads as it requires, which can then be scheduled to run in parallel on different cores. As all recent Macs have more than one core, processes with more than one thread can make good use of more than one core, and so run faster.

Take the example of a file compressor. If it’s coded so that it can perform its compression in four threads that can be run simultaneously, then it will compress files in roughly a quarter of the time when it runs on four CPU cores, compared with running on a single core (ignoring input and output to disk).

That only works when those four cores are all free. If your Mac is also trying to build its Spotlight indexes at the same time, the threads doing that will compete with those of your compression app. That’s where the thread’s Quality of Service (QoS) settings come in, as they assign priority. On Apple silicon Macs, a thread’s QoS will also help determine whether it’s run on its Performance or Efficiency cores.

Standard QoS settings

QoS is set by the process, and is normally chosen from the standard list:

  • QoS 9 (binary 001001), named background and intended for threads performing maintenance, which don’t need to be run with any higher priority.
  • QoS 17 (binary 010001), utility, for tasks the user doesn’t track actively.
  • QoS 25 (binary 011001), userInitiated, for tasks that the user needs to complete to be able to use the app.
  • QoS 33 (binary 100001), userInteractive, for user-interactive tasks, such as handling events and the app’s interface.

There’s also a ‘default’ value of QoS between 17 and 25, an unspecified value, and in some circumstances you might come across others used by macOS.

These are the QoS values exposed to the programmer. Internally, macOS uses a more complex scheme with different values.

CPU core type

When running apps on Intel Macs, because all their CPU cores are identical, QoS has more limited effect, and is largely used to determine priority when there are threads queued for execution on a limited number of cores.

Apple silicon Macs are completely different, as they have two types of CPU core, Efficiency (E) cores designed to use less energy and normally run at lower frequencies, and Performance (P) cores that can run at higher frequencies and deliver maximum performance, but using more energy.

QoS is therefore used to determine which type of core a thread should be run on. Threads with a QoS of 9 (background) are run on E cores, and can’t be promoted to run on P cores, even when there are inactive P cores and the E cores are heavily loaded. Threads with a QoS of 17 and above will be preferentially run on P cores when they’re available, but when they’re all fully occupied, macOS will run them on E cores instead. In that case, the E cores will be run at higher frequencies for better performance with less economy.

If your Apple silicon Mac has a base variant chip with 4 E and 4 P cores, this results in the following:

  • apps with a total of up to 4 threads at high QoS will be scheduled and run at full speed on the P cores;
  • when those P cores are all busy with high QoS threads, running another thread will then result in that being run on the E cores, and slightly slower than it would on a P core;
  • a total of 8 high QoS threads can thus be run on P and E cores together;
  • when running low QoS background threads on E cores, a maximum of 4 can be run at any time when the E cores are available, but those threads can’t spill over and run on the P cores, even if those are idle.

Controls

As QoS is normally either set by the process for its threads, or for services in their LaunchDaemon or LaunchAgent property list, the user has little direct control. A few apps now provide settings to adjust the QoS of their worker threads. Among those in the compression utility Keka, together with a couple of my own utilities such as the Dintch integrity checker.

polycore4

In Keka’s settings, you can give its tasks a maximum number of threads, and even run them at custom Quality of Service (QoS) if you want them to be run in the background on E cores, and not interrupt your work on P cores.

dintchcheck14

Dintch has a simple slider, with the green tortoise to run it on E cores alone, and the red racing car at full speed on the P cores.

App Tamer and taskpolicy

The great majority of threads run at low QoS on the E cores are those of macOS and its services like Spotlight indexing. When a thread has already been assigned a low QoS, there’s currently no utility or tool that can promote it so it’s run at a higher QoS. In practice this means that you can’t accelerate those tasks.

What you can do, though, is demote threads with higher QoS to run at low QoS, more slowly and in the background. The best way to do this is using St. Clair Software’s excellent utility App Tamer. If you prefer, you can use the taskpolicy command tool instead. For instance, the command
taskpolicy -b -p 567
will confine all threads of the process with PID 567 to the E cluster, and can be reversed using the -B option for threads with higher QoS (but not those set to low QoS by the process).

qoscores1

That can be seen in this CPU History window from Activity Monitor. An app has run four threads, two at low QoS and two at high QoS. In the left side of each core trace they are run on their respective cores, as set by their QoS. The app’s process was then changed using taskpolicy -b and the threads run again, as seen in the right. The two threads with high QoS are then run together with the two with low QoS in the four E cores alone.

Virtualisation

Although Game Mode does alter the effects of QoS and core allocation, its impact is limited. The one significant exception to the way that QoS works is in virtualisation.

macOS Virtual Machines running on Apple silicon chips are automatically assigned a high QoS, and run preferentially on P cores. Thus, even when running threads at low QoS, those are run within threads on the host’s P cores. This remains the only known method of electively running low QoS threads on P cores.

Key points

  • Threads are single flows of code execution run on one CPU core at a time, sharing virtual memory allocated to that process, but with their own stack.
  • Apps and processes set the Quality of Service (QoS) for each of the threads they run.
  • On Apple silicon chips, low QoS of background results in that thread being run on E cores alone.
  • Higher QoS threads are preferentially allocated to P cores, but when they aren’t available, that thread will be run on E cores at high frequency.
  • Some apps now provide controls over the QoS of their worker threads.
  • App Tamer and taskpolicy let you demote high QoS threads to be run with low QoS on the E cores, but can’t promote low QoS threads to run faster on P cores.
  • Virtual machines run all threads at high QoS as far as the host Mac is concerned.

Further reading

Apple’s Energy Efficiency Guide for Mac Apps, last revised 13 September 2016, so without any mention of Apple silicon.
Apple silicon: 1 Cores, clusters and performance
Apple silicon: 2 Power and thermal glory
Apple silicon: 3 But does it save energy?

Take control of disks using APFS

By: hoakley
18 March 2025 at 15:30

If you want a quiet life, just format each external disk in APFS (with or without encryption), and cruise along with plenty of free space on it. For those who need to do different, either getting best performance from a hard disk, or coping with less free space on an SSD, here are some tips that might help.

File system basics

A file system like APFS provides a logical structure for each disk. At the top level it’s divided into one or more partitions that in APFS also serve as containers for its volumes. Partitions are of fixed size, although you can always repartition a disk, a process that macOS will try to perform without losing any of the data in its existing partitions. That isn’t always possible, though: if your 1 TB disk already contains 750 GB, then repartitioning it into two containers of 500 GB each will inevitably lose at least 250 GB of existing data.

All APFS volumes within any given container share the same disk space, and by default each can expand to fill that. However, volumes can also have size limits imposed on them when they’re created. Those can reserve a minimum size for that volume, or limit it to a maximum quota size.

How that logical structure is implemented in terms of physical disk space depends on the storage medium used.

Faster hard disks

Hard disks store data in circular tracks of magnetic material. To store each file requires multiple sectors of those tracks, each of which can contain 512 or 4096 bytes. As the length (circumference) of the tracks is greater as you move away from the centre of the platter towards its edge, but the disk spins at a constant number of revolutions per minute (its angular velocity is constant), it takes a shorter time for sectors at the periphery of the disk to pass under the heads than for those closer to the centre of the platter. The result is that read and write performance also varies according to where files are stored on the disk: they’re faster the further they are from the centre.

HardDiskSpeedVLocation

This graph shows how read and write speeds change in a typical compact external 2 TB hard disk as data is stored towards the centre of the disk. At the left, the outer third of the disk delivers in excess of 130 MB/s, while the inner third at the right delivers less than 100 MB/s.

You can use this to your advantage. Although you don’t control exactly where file data is stored on a hard disk, you can influence that. Disks normally fill with data from the periphery inwards, so files written first to an otherwise empty disk will normally be written and read faster.

You can help that on a more permanent basis by dividing the disk into two or more partitions (APFS containers), as the first will normally be allocated space on the disk nearest the periphery, so read and write faster than later partitions added nearer the centre. Adding a second container or partition of 20% of the total capacity of the disk won’t cost you much space, but it will ensure that performance doesn’t drop off to little more than half that achieved in the most peripheral 10%.

Reserving free space on SSDs

The firmware in SSDs knows nothing of its logical structure, and for key features manages the whole storage as a single unit. Wear levelling ensures that individual blocks of memory have similar numbers of erase-write cycles, so they age evenly. Consumer SSDs that use dynamic SLC write caches allocate those from across the whole storage, and aren’t confined to partitions. You can thus manage free space to keep sufficient dynamic cache available at a disk level.

One approach is to partition the SSD to reserve a whole container, with its fixed size, to support the needs of the dynamic cache. An alternative is to use volume reserve and quota sizes for the same purpose, within a single container. For example, in a 1 TB SSD with a 100 GB SLC write cache you could either:

  • with a single volume, set its quota to 900 GB, or
  • add an empty volume with its reserve size set to 100 GB.

Which of these you choose comes down to personal preference, although on boot volume groups you won’t be able to set a quota for its Data volume, and the most practical solution for a boot disk is to add an empty volume with a specified reserve size.

To do this when creating a new volume, click on the Size Options… button and set the quota or reserve.

Summary

  • Partition hard disks so that you only use the fastest 80% or so of the disk.
  • To reserve space in an SSD for dynamic caching, you can add a second APFS container.
  • A simpler and more flexible way to reserve space on SSDs is setting a quota size for a single volume, or adding an empty volume with a reserve size.
  • Size options can currently only be set when creating a volume.

Why SSDs slow down, and how to avoid it

By: hoakley
17 March 2025 at 15:30

Fast SSDs aren’t always fast when writing to them. Even an Apple silicon Mac’s internal SSD can slow alarmingly in the wrong circumstances, as some have recently been keen to demonstrate. This article explains why an expensive SSD normally capable of better than 2.5 GB/s write speed might disappoint, and what you can do to avoid that.

In normal use, there are three potential causes of reduced write speed in an otherwise healthy SSD:

  • thermal throttling,
  • SLC write cache depletion,
  • the need for Trimming and/or housekeeping.

Each of those should only affect write speed, leaving read speed unaffected.

Thermal throttling

Writing data to an SSD generates heat, and writing a lot can cause it to heat up significantly. Internal temperature is monitored by the firmware in the SSD, and when that rises sufficiently, writing to it will be throttled back to a lower speed to stabilise temperature. Some SSDs have proved particularly prone to thermal throttling, among them older versions of the Samsung X5, one of the first full-speed Thunderbolt 3 SSDs.

In testing, thermal throttling can be hard to distinguish from SLC write cache depletion, although thorough tests should reveal its dependence on temperature rather than the mere quantity of data written.

The only solution to thermal throttling is adequate cooling of the SSD. Internal SSDs in Macs with active cooling using fans shouldn’t heat up sufficiently to throttle, provided their air ducts are kept free and they’re used in normal ambient temperatures. Well-designed external enclosures should ensure sufficient cooling using deep fins, although active cooling using small fans remains more controversial.

SLC write cache

To achieve their high storage density, almost all consumer-grade SSDs store multiple bits in each of their memory cells, and most recent products store three in Triple-Level Cell or TLC. Writing all three bits to a single cell takes longer than it would to write them to separate cells, so most TLC SSDs compensate by using caches. Almost all feature a smaller static cache of up to 16 GB, used when writing small amounts of data, and a more substantial dynamic cache borrowed from main storage cells by writing single bits to them as if they were SLC (single-level cell) rather than TLC.

This SLC write cache becomes important when writing large amounts of data to the SSD, as the size of the SLC write cache then determines overall performance. In practice, its size ranges from around 2.5% of total SSD capacity to over 10%. This can’t be measured directly, but can be inferred from measuring speed when writing more data than can be contained in the cache. As it can’t be emptied during full-speed write, once the dynamic cache is full, write speed suddenly falls; for example a Thunderbolt 5 SSD with full-speed write of 5.5 GB/s might fall to 1.4 GB/s when its SLC write cache is full. This is seen in both external and internal SSDs.

To understand the importance of SLC write cache in determining performance, take this real-world example:

  • 100 GB is written to a Thunderbolt 5 SSD with an SLC write cache of 50 GB. Although the first half of the 100 GB is written at 5.5 GB/s, the remaining 50 GB is written at 1.4 GB/s because the cache is full. Total time for the whole write is then 44.8 seconds.
  • Performing the same to a USB4 SSD with an SLC write cache in excess of 100 GB has a slower maximum rate of 3.7 GB/s, but that’s sustained for the whole 100 GB, which then takes only 27 seconds, 60% of the time of the ‘faster’ SSD.

To predict the effect of SLC write cache size on write performance, you therefore need to know cache size, out-of-cache write speed, and the time required to empty a full cache between writes. I have looked at these on two different SSDs: a recent 2 TB model with a Thunderbolt 5 interface, and a self-assembled USB4 OWC 1M2 enclosure containing a Samsung 990 Pro 2 TB SSD. Other enclosures and SSDs will differ, of course.

The TB5 SSD has a 50 GB SLC write cache, as declared by the vendor and confirmed by testing. With that cache available, write speed is 5.5 GB/s over a TB5 interface, but falls to 1.4 GB/s once the cache is full. It then takes 4 minutes for the cache to be emptied and made available for re-use, allowing write speeds to reach 5.5 GB/s again.

The USB4 SSD has an SLC write cache in excess of 212 GB, as demonstrated by writing a total of 212 GB at its full interface speed of 3.7 GB/s. As the underlying performance of that SSD is claimed to exceed that required to support TB5, putting that SSD in a TB5 enclosure should enable it to comfortably outperform the other SSD.

Two further factors could affect SLC write cache: partitioning and free space.

When you partition a hard disk, that affects the physical layout of data on the disk, a feature sometimes used to ensure that data only uses the outer tracks where reads and writes are fastest. That doesn’t work for SSDs, where the firmware manages storage use, and won’t normally segregate partitions physically. That ensures partitioning into APFS containers doesn’t affect SLC write cache, either in terms of size or performance.

Free space can be extremely important, though. SLC write cache can only use storage that’s not already in use, and if necessary has been erased ready to be re-used. If the SSD only has 100 GB free, then that can’t all be used for cache, so limiting the size that’s available. This is another good reason for the performance of SSDs to suffer when they have little free space available.

Ultimately, to attain high write speeds through SLC write cache, you have to understand the limits of that cache and to work within them. One potential method for effectively doubling the size of that cache might be to use two SSDs in RAID-0, although that opens further questions.

Trim and housekeeping

In principle, Trim appears simple. For example, Wikipedia states: “The TRIM command enables an operating system to notify the SSD of pages which no longer contain valid data. For a file deletion operation, the operating system will mark the file’s sectors as free for new data, then send a TRIM command to the SSD.” A similar explanation is given by vendors like Seagate: “SSD TRIM is a command that optimizes SSDs by informing them which data blocks are no longer in use and can be wiped. When files are deleted, the operating system sends a TRIM command, marking these blocks as free for reuse.”

This rapidly becomes more complicated, though. For a start, the TRIM command for SATA doesn’t exist for NVMe, used by faster SSDs, where its closest substitute is DEALLOCATE. Neither is normally reported in the macOS log, although APFS does report its initial Trim when mounting an SSD. That’s reported for each container, not volume.

What we do know from often bitter experience is that some SSDs progressively slow down with use, a phenomenon most commonly (perhaps only?) seen with SATA drives connected over USB. Those also don’t get an initial Trim by APFS when they’re mounted.

It’s almost impossible to assess whether time required for Trim and housekeeping is likely to have any adverse effect on SSD write speed, provided that sufficient free disk space is maintained to support full-speed writing to the SLC write cache. Neither does there appear to be any need for a container to be remounted to trigger any Trim or housekeeping required to erase deleted storage ready for re-use, provided that macOS considers that SSD supports Trimming.

Getting best write performance from an SSD

  • Avoid thermal throttling by keeping the SSD’s temperature controlled. For internal SSDs that needs active cooling by fans; for external SSDs that needs good enclosure design with cooling fins or possibly a fan.
  • Keep ample free space on the SSD so the whole of its SLC write cache can be used.
  • Limit continuous writes to within the SSD’s SLC write cache size, then allow sufficient time for the cache to empty before writing any more.
  • It may be faster to use an SSD with a larger SLC write cache over a slower interface, than one with a smaller cache over a faster interface.
  • Avoid SATA SSDs.

I’m grateful to Barry for raising these issues.

❌
❌