Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Inside M4 chips: CPU core management

By: hoakley
5 December 2024 at 15:30

Whether you’re a developer or user, gaining an understanding of how macOS manages the cores in an Apple silicon CPU is important. It explains what you will see in action when you open Activity Monitor, how your apps get to deliver optimal performance, and why you can’t speed up background tasks like Time Machine backups. In this series (links below) I’ve been trying to piece this together for the M4 family, and this article is my first attempt to summarise as much of the story as I know so far. I therefore welcome your comments, counter-arguments and improvements.

Scope

For the purposes of this article, I’ll consider a single thread that macOS is ready to load onto a CPU core for execution. For that to happen, five decisions are to be made:

  • which type of core, P or E,
  • which cluster to run it in,
  • which core within that cluster,
  • what frequency to run that cluster at,
  • the mobility of that thread between cores in the same cluster, and between clusters (when available).

Which type of core?

Since the early days of analysing M1 CPUs, it has been clear that the choice between P and E core types is made on the Quality of Service (QoS) assigned to the thread, availability of a core of that type, and whether the thread uses a co-processor such as the AMX. For the sake of generality and simplicity, I’ll here ignore the last of those, and consider only threads that are executed by the CPU alone.

QoS is primarily set by the process owning that thread, in the setting exposed to the programmer and user, although internally QoS is modulated by other factors including thermal environment. Threads assigned a QoS of 9 or less, designated Background, are allocated exclusively to E cores, while those with higher QoS of ‘user’ levels are preferentially allocated to P cores. When those are unavailable, they may be allocated to E cores instead.

This can be changed on the fly, reassigning higher QoS threads to run on E cores, but that’s not currently possible the other way around, so low QoS threads can’t be run on P cores. The sole exception to that is when run inside a virtual machine, when VM virtual cores are given high QoS, allowing low QoS threads within the VM to benefit from the speed of P cores.

Which cluster?

M4 Pro and Max variants have two clusters of P cores, so the next decision is which of those to run a higher QoS thread on:

  • if both clusters are shut down, one will be chosen and its frequency brought up;
  • if one cluster is already running and has sufficient idle residency (‘available residency’) to accommodate the thread, that will be chosen;
  • if one cluster already has full active residency, the other cluster will be chosen;
  • if both clusters already have full active residency, then the thread will be allocated to the E cluster instead.

This fills an active P cluster before allocating threads to the inactive one, and fills both P clusters before allocating a higher QoS thread to the E cluster.

m4coremanagement1

Cluster allocation is of course simpler on the base M4, where there’s only one E and one P cluster. Should Apple make an M4 Ultra as expected, then that would not only have four P clusters, but two E clusters. Until that happens, we can only speculate that similar would then apply, including the E clusters.

Which core within the cluster?

This is perhaps the simplest decision to make. If there’s only one core with available residency in that cluster, that’s the only choice. Otherwise macOS picks an arbitrary core from those available, apparently to ensure roughly even use of cores within each cluster.

What frequency?

For the E cluster, choice of frequency appears straightforward, with low QoS threads being run at minimum E core frequency of 1,020 MHz or slightly higher, and higher QoS threads spilt over from fully occupied P clusters are run at E core maximum frequency of 2,592 MHz.

P cluster frequency appears to be determined by the total active residency of that cluster after the new thread has been added. When that’s the only thread running in that cluster, maximum frequency of 4,512 MHz is chosen, but rising total active residency reduces that in steps down to about 3,852 MHz when all the cores in the cluster are at 100% active residency. In most cases, the big reduction in frequency occurs when going from about 200% to 300% total active residency. This currently appears to be part of a strategy to pre-emptively minimise the risk of thermal stress within the chip.

m4coremanagement2

Thread mobility

Once the thread has been loaded into a core in the optimal cluster running at the chosen frequency, it’s likely to be moved periodically, both to any other free core within that cluster, and to another cluster, when available. While this does occur in previous M-series CPUs, it appears particularly prominent in M4 variants.

Movement of threads within a cluster can occur quite frequently, every 0.1 second or so, particularly within the E cluster. Movement between clusters occurs less frequently, about every 4-5 seconds, and would only occur when the other cluster is shut down or idle, so free to run all the threads of the current cluster. This is most probably to ensure even thermal conditions within the chip.

Summary

The whole strategy is shown in the following diagram, also available as a tear-out PDF from here: m4coremanagement1

m4coremanagement3

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Inside M4 chips: Matrix processing and Power Modes
Inside M4 chips: Controlling frequency

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

How is Thunderbolt 5 doing so far?

By: hoakley
4 December 2024 at 15:30

When Apple launched its first M4 Macs just over a month ago, I was surprised that models with M4 Pro or Max chips offered Thunderbolt 5. Although there are still relatively few computers in use with support for TB5, and a dearth of peripherals, this article summarises early experience with this exotic new bus.

What TB5 peripherals are available?

As far as I’m aware, as of today there’s only one Thunderbolt 5 peripheral shipping in quantity, the Kensington SD5000T5 Thunderbolt 5 Triple Docking Station, with a total of three downstream TB5 ports. I’m just completing a full review of this, due to appear in MacFormat and MacLife magazines early next year.

SSDs have been announced by OWC in its Envoy Ultra, and Sabrent. The first of OWC’s have apparently started to ship, although they aren’t expected to become readily available until the New Year, when Sabrent’s are also expected.

OWC has also announced a TB5 hub, but that’s unlikely to appear until next year.

Several PCs are now available with TB5 support, although that seems to be fiddly to configure in Windows. Among those is the Razer Blade 18, which is even more expensive than a MacBook Pro with an M4 Max.

Other than those, there are lots of expensive TB5 cables, just precious little to connect to the other end.

Multiple displays

Many of those rushing to buy into TB5 are doing so because of its promised support for multiple displays. For example, the Kensington dock claims to support up to three 6K displays at 60 Hz with the M4 Max, and two with the M4 Pro. Although I have been unable to test those combinations, there are already reports that the M4 Max works well with three displays connected direct, but only two of those work when using the Kensington dock.

This has apparently taken Kensington and Apple by surprise, but until this has been addressed, I wouldn’t assume that you’ll be able to use all three displays attached to the dock.

Multiple SSDs

In early 2023, when TB4 hubs were becoming available, I wrote a whole series of articles here analysing their performance with a range of different SSDs. Links to those are given at the end. Those predated OWC’s superb Express 1M2 USB4 enclosure that now offers consistent and reliable performance for Apple silicon Macs, but not Intel models, which unfortunately lack support for USB4.

I have recently been revisiting SSD performance, both directly connected to my Mac mini M4 Pro, and working through the Kensington dock. Although some results are impressive, there are others that shock.

sysinfotb5

As shown in System Information, this dock connects to the host at 80 Gb/s, and to each USB4 drive at the expected 40 Gb/s.

On the bright side, 1M2 enclosures that return direct read/write speeds of 3.7/3.7 GB/s read almost as fast when attached through the dock, but their write speed drops to 2.3 GB/s, similar to many TB3 SSDs attached directly. You can even connect a USB4 drive to each of the dock’s three TB5 ports to benchmark them simultaneously, and get read/write results of 2.1/2.1 GB/s on each of them. That performance represents the maximum total data transfer capacity, matching claims of 6 GB/s made of TB5 SSDs, and equating to 80 Gb/s in TB5/USB4v2 symmetric mode.

tb5tests

Results from TB3 SSDs are more worrying. An award-winning certified Thunderbolt 3 SSD that achieves 2.9/2.2 GB/s read/write attached direct maintained a good read speed through the dock at 2.8 GB/s, but it almost ground to a halt during the write test, at 422 MB/s, that’s roughly the speed you’d expect from a basic SATA SSD.

You can read similar experiences during early testing of this dock for PC World.

For the time being, TB5 performs well with USB4 and directly connected TB3 SSDs, and the dock is a good solution for those wanting high-speed access to two or three USB4 SSDs in OWC Express 1M2 enclosures. The dock does have serious problems when writing to TB3 SSDs, though, where it may fall far short of expectations. Hopefully these problems will be resolved early next year.

Recommendations

  • Although TB5 promises much, initial tests show that it currently has problems meeting that.
  • Reports indicate that it may not yet support M4 Max chips driving three 6K displays at 60 Hz from a TB5 dock.
  • Performance claimed for TB5 SSDs has not yet been confirmed in independent tests.
  • Performance of TB3 SSDs attached to a TB5 dock demonstrates some very poor write speeds.
  • Performance of USB4 SSDs attached to a TB5 dock demonstrates better and more consistent results, although their write speed also falls.
  • TB5 cables and peripherals are expensive.
  • Thunderbolt 5 is still at an experimental stage, and may take some time before it realises its potential.

Thunderbolt performance and TB4 hubs

General hub performance
Write speed throttling
How faster SSDs can impair performance of slower ones
Three SSDs on one hub
Getting best performance from Thunderbolt on Apple silicon Macs: a practical guide

Testing with Stibium

When using the ‘gold standard’ method of testing storage using my free Stibium, you don’t normally need to restart the Mac between write and read speed measurements. This has changed with the Mac mini M4 Pro, at least. If you go straight on to measure read speeds, results will be bogus because of what appears to be extensive caching of the files written during the previous write test. That results in absurdly high read speeds of more than 6 GB/s in most cases. This is surprising, as a total of just over 53 GB of files are written during the full write test, which seems far more than macOS should ever cache successfully!

For these tests on external SSDs, I therefore quit Stibium after measuring write speed, unmount the volume tested, remount it in Disk Utility, and open Stibium again to perform the read tests. This apparently clears caches reliably, and read speeds are consistent and in accord with those expected.

Interests

I bought my own Mac mini M4 Pro, Kensington SD5000T5 Thunderbolt 5 Triple Docking Station, and all the OWC Express 1M2 enclosures and their SSDs, at their regular retail prices. The only product tested here that has been provided by a manufacturer is, rather sadly, the TB3 SSD.

Inside M4 chips: Controlling frequency

By: hoakley
2 December 2024 at 15:30

To realise best performance and energy efficiency from the big.LITTLE architecture in Apple’s M-series chips requires careful management on the part of macOS. There’s much more to it than balancing loads over conventional multi-core CPUs with a single type of core, as each execution thread needs to be run in an optimal location. When deciding where to run a CPU thread, macOS controls:

  • which type of core, P or E, primarily determined by the thread’s Quality of Service (QoS), and core availability;
  • which cluster to run it in, for chips with more than one cluster of that type, set to try to keep as few clusters active as necessary;
  • which core within that cluster, determined by core availability, and semi-randomised to even out core use;
  • what frequency to run that cluster at, in turn depending on the core type and the thread’s QoS;
  • mobility of that thread between cores in the same cluster, and between clusters (when available).

Over the last four years, I have explored the rules apparently used for the first two, and the choice of frequency in E cores. This article looks in more detail at how the frequency of P clusters appears to be determined in M1, M3 and particularly M4 chips.

powermetrics provides more frequency figures than you know what to do with, although most are derived and to some extent imaginary, making reconciliation difficult. In tests reported in the previous article, I used those given as Cluster HW active frequency to demonstrate distinctive patterns seen on M4 P cores running different numbers of test threads.

m4frequenciesByThreads

This graph shows those frequencies for the active P cluster by the number of threads, for floating point, NEON and vDSP_mmul tests detailed previously. Frequencies for the first two tests are identical, at P core maximum for a single thread, then falling sharply from 2 to 3 threads. When more threads are run, a cluster that’s fully active is run at the same frequency as that for 5 threads (P cluster size on this M4 Pro), while the other P cluster follows the same frequencies shown in the graph for the number of threads it’s running.

To examine this further I first climbed a mountain.

Climbing a mountain

For this test I used three copies of my test app to run a total of three identical threads of my in-core floating point test code in a mountain pattern. I first started powermetrics gathering data, then launched the first thread, followed by the second, and then the third. My objective was to observe an initial period when just one test thread was running, a second with the second test thread in addition, a third when all three threads would be running, and then watch the sequence reverse as each thread ended. This is shown in the results below.

m43appflopt1

This chart shows active residencies by core and cluster for the P cores in an M4 Pro, with 5 P cores in each of its two P clusters, during this test. For the first 15 sample periods (1.5 seconds), a single test thread is moved around between cores in the second P cluster (P1). That’s joined by the second thread run on another core in the same cluster, until sample 30, when the third thread is added, pushing the total active residency to 300%.

At that point, all three threads are moved to three cores in the first P cluster (P0), whose bars are shown in blues and green. The first thread completes in sample 37, leaving two threads with 200% active residency to continue in that cluster until sample 50, when the second thread completes, leaving just one running. In sample 54 (5.4 seconds after test start), that one remaining thread is moved back to complete on core P11 in the second cluster late in sample 63.

In that period of 6.3 seconds, each of the two P clusters has run 1, 2 and 3 threads.

m43appflopt2

This graph shows cluster frequencies over the same period, this time given in seconds elapsed rather than sample number. The red line and points show the frequency of cluster P1, and blue for P0. Those undergo step changes when each cluster is running test threads. The inactive cluster is normally shut down with a frequency of 0 MHz, although there are some brief spikes from that as well.

m43appflopt3

Combining active residency bars in yellow with core frequency lines, it’s clear that cluster frequency is close to core maximum at 4,500 MHz when only a single thread is running. With two threads, it’s reduced to 4,400 MHz, and down to 3,900 MHz when all three threads are running. Those changes are symmetrical for loading and unloading clusters, and shown no signs of hysteresis (different values during loading and unloading).

Closer examination gives frequencies of 4,511 MHz at 100% active residency, 4,415 MHz at 200%, and 3,924 MHz at 300%e. The latter is 87% of maximum frequency, a large enough reduction to be reflected in performance. Essentially identical figures are found for NEON tests as well as these for floating point.

Although this test method can give highly reproducible results, the floating point and NEON tests used don’t resemble threads seen in everyday use. The next step is to extend that by looking at thread numbers and frequency when running more normal code.

Compressing a file

Fortunately, I have already built a suitable platform for real-world testing in a one-trick pony named Cormorant, a basic compression-decompression utility using Apple Archive. Although not a patch on serious apps like Keka, Cormorant can set the number and QoS of threads to be run during compression/decompression. Because it relies on Apple’s framework, it actually runs more than just the threads set in its controls, but still provides a way to control active residency. I therefore ran a test compression (15.5 GB IPSW image file) at maximum QoS to ensure it’s dispatched to P cores, and 1-3 threads.

Time taken to compress the test file changes greatly according to the number of threads used:

  • 1 thread takes 49.6 s, at 313 MB/s;
  • 2 threads takes 26.8 s, at 578 MB/s, 191% of the throughput of the single thread;
  • 3 threads takes 18.7 s, at 829 MB/s, 265% of the single thread.

These appear to follow the pattern of frequencies observed on my in-core tests.

m4corm1thread1

This chart shows the opening 3 seconds of single-thread compression, with cluster frequency in the points and line, and total active residency multiplied by 10 (to scale to a common y axis) in pale blue bars. Two significant periods are shown: in samples 12-21, active residency is high, between 300-430%, and frequency is lower at around 4,000 MHz. Following that, active residency falls to about 200% and frequency rises to 4,200 MHz.

Because active residency was so variable in these tests, I pooled paired values for that and cluster frequency, gathered over 3 second periods, and plotted those.

m4corm1thread2

Although at active residencies below about 180% there’s a wide scatter, above that there’s a good linear regression, showing steady decline in frequency over active residencies ranging from 180% to 450%.

The following two graphs show equivalents for tests using 2 and 3 threads. The first of those has two outliers at total active residencies above 490%, corresponding to unusual conditions during the test. I have therefore excluded those from subsequent analyses.

m4corm2thread1

m4corm3thread1

The last step is to pool paired results from all three test conditions, and arrive at a line of best fit.

m4corm1-3Poolthread1

Between total cluster residencies of 150-500%, this works best with a quadratic curve with the equation
F = 4630.05 – (2.0899 x R) + (0.0010107 x R^2)
from which a different relationship is predicted between F, frequency in MHz, and R, total active residency in %.

My other real-world test makes use of the fact that, when virtualising macOS, the number of virtual cores on the host is specified.

Hosting virtual cores

Although virtualisation relies on frameworks run on the host, experience is that its demand on host P cores is constrained to the number of virtual cores allocated, with each of those resulting in 100% active residency, equating to the whole of a P core on the host. powermetrics started collecting sample periods immediately before a macOS 14 VM was launched, and the first 3 seconds (30 samples) were collected and analysed for VMs with 1-3 virtual cores.

m4vm1thread1

This shows the first of those, a VM allocated just a single virtual core, with cluster frequencies shown as red and blue lines, and total active residency multiplied by 10 in the pale blue bars. With a steady total active residency of 100%, active cluster frequency was about 4,500 MHz. Note that sample 7 included transfer of the threads from P1 to P0 in a sharp peak to a total active residency of over 500%.

Average frequencies can thus be calculated for each of the three tests, at 100-300% active residencies.

Set frequencies

I now have estimates of cluster frequencies for cluster total active residencies from:

  • in-core tests using floating point
  • in-core tests using NEON
  • compression
  • virtualisation

against which I compare a matrix multiplication test that may be run on shared matrix co-processors (AMX). These are shown in the table below.

m4pclusterfreq

Running a single thread in a cluster should result in a total active residency of 100%, for which macOS sets the cluster frequency at P core maximum, of 4,400-4,511 MHz. That for 200% is lower, at between 4,000-4,400 MHz, and falls off further to about 3,800 when all 5 cores are at 100% active residency. Frequencies set for the vDSP_mmul test are significantly lower throughout, supporting the proposal that test isn’t being run conventionally in P cores, but in a co-processor.

A sixth thread would then be loaded onto the other P cluster, where cluster frequency would be set at P core maximum again, progressively reducing with additional threads until that cluster was also running at about 3,800 MHz.

Following this, I returned to the tests I have performed over the last four years on M1 and M3 P cores. Although I haven’t analysed those formally, I now believe that their frequencies are controlled by macOS as follows:

  • M1 1 core at 3,228 MHz, 2 cores 3,132 MHz, 3-6 cores 3,036 MHz.
  • M3 1 core at 3,624 MHz (below maximum of 4,056), 2-6 cores 3,576 MHz.

The range of frequencies in the M1 and M3 is narrower, resulting in less difference in performance between single- and multi-core tests. However, the M4 falls to 87% maximum frequency at 3 threads and more, which is substantial. It’s worth noting that Geekbench single-core results for the M4 are around 3,892 and would scale up to a multi-core result of 38,920 on an M4 Pro with 10 P cores, whereas the actual multi-core score is about 22,700, 58% of the scaled value. Although the effects of lower frequency can’t account for all that difference, they must surely contribute to it.

Why?

Two plausible contenders for the reason that macOS reduces P cluster frequency with increasing active residency are for thermal management, hence reliability, and when competing for a limited resource, perhaps the L2 cache shared within each cluster.

Reductions in cluster frequency seen here isn’t thermal throttling, though. Tests were intentionally kept brief in order to accommodate their results in reasonably short series of powermetrics results. Power use was highest in the NEON and vDSP_mmul tests, and lowest in floating point, although there don’t appear to be matching differences in frequency control. As noted in the previous tests, High Power mode didn’t alter frequency control, although frequencies were reduced in Low Power mode.

It’s most likely that this frequency regulation is pre-emptive, and based not just on CPU cores, but allows for likely heat output in the rest of the Mac.

Key information

  • When running on Apple silicon Macs, macOS modulates ‘cluster HW active frequency’ of P cores, limiting frequency to below maximum when cluster total active residency exceeds 100%.
  • Although small in M1 variants, this is most prominent in M4 variants, where a total active residency of 300% may reduce cluster frequency to 87% of maximum.
  • Frequency limitation is most probably part of a pre-emptive strategy in thermal management.
  • Frequency limitation is at least partly responsible for non-linear changes in performance with increasing recruitment of P cores, as illustrated in single- and multi-core benchmarks.
  • Control of P cores by macOS is complex, particularly in M4 variants.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Inside M4 chips: Matrix processing and Power Modes

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

Acknowledgements

Several of you have contributed to discussions here, but Maynard Handley has for several years provided sage advice, challenging discussion, and his personal mine of information. Thank you all.

Inside M4 chips: Matrix processing and Power Modes

By: hoakley
27 November 2024 at 15:30

For much of the last four years, CPU and GPU core performance have been of primary importance to those using Apple silicon Macs. Despite that, special interest has developed in those cores that Apple doesn’t talk about, in its matrix co-processor, AMX. Since the first M1 it has been believed that each CPU core cluster has its own AMX, and more recently it has been shown to be capable of impressive performance. With the increasing prominence of computationally intensive features including AI, this is growing in importance.

When the M4 first became available in iPads earlier this year, researchers concluded its chip has a new AMX co-processor that can now be programmed using SME matrix extensions supported by the new version of the Arm instruction set in the M4, ARMv9.2-A. The best accounts of that challenging work are given on the Friedrich Schiller University’s site Hello SME, and Taras Zakharko’s GitHub.

From early work by Dougall Johnson on the M1, it has been known that some of the functions in Apple’s vast Accelerate maths libraries can run code on the AMX. Thanks to the guidance of Maynard Handley, a year ago I concluded that one of those is the vDSP_mmul function in the vDSP sub-library. This article reports tests of that function in a Mac mini M4 Pro running Sequoia 15.1.1, leads on to an explanation of previous results using floating point and NEON tests, and considers the effects of Power Modes.

Methods

I used the same methods as described previously. These consist of running many tight loops of test code designed to be confined as much as possible to register access. The test used here uses vDSP_mmul, multiplying two 16 x 16 32-bit floating point matrices, and consists of 15 x 10^6 such loops. Its source code is given in the Appendix at the end. During each test run, the command tool powermetrics gathers core power and performance data in sampling periods of 0.1 second for a total of 50 samples. Its output is piped into a text file, which is then analysed using Numbers and DataGraph.

For comparison, I also show here my results from equivalent tests using assembly code for the NEON vector processor, as given in that previous article.

Power used by thread

The first graphs show average power use during each test with increasing numbers of threads, where error bars indicate a spread of +1 standard deviation.

m4powervdsp1

The linear relationship for 1-10 threads, run on 1-10 P cores, is a better fit for vDSP_mmul, shown above, than NEON below. Run on P cores, each vDSP_mmul thread used about 3.6 W, significantly greater than NEON at 3.0 W. However, when those high QoS threads spilt over onto E cores, that relationship for vDSP_mmul broke down, leaving the highest power use that for 10 threads, one for each P core available in the M4 Pro chip used.

There’s no evidence of the steps seen below in NEON at 2-3 and 7-8 threads.

m4powerneon1

Execution time

m4powervdsp2

Total execution time has a strongly linear relationship too, for vDSP_mmul at about 2.6 seconds per thread. Again, that relationship broke down once test threads had spilt over onto E cores, unlike in the graph below for NEON.

m4powerneon2

m4powervdsp3

There are also obvious differences in the time required to execute each thread. vDSP_mmul (above) showed a linear increase from 1-5 threads, followed by a constant time for 5-10 threads. Once threads were running on E cores, the relationship became far looser, as seen in the red regression line. In NEON below, the relationship was weakest in the 1-5 thread range, and closer on the E cores from 11-14 threads.

m4powerneon3

Energy use

m4powervdsp4

When run on P cores, there’s a good line of fit indicating energy use of 9.2 J per thread (above), compared with NEON (below) at 7.7 J. The red line of best fit for the E core section from 11-14 cores suggests E core energy use of about 4.8 J per thread, again higher than NEON, which shows a much better fit and only about 3 J per thread.

m4powerneon4

Maximum total energy use estimated for vDSP_mmul was just over 140 J, while for NEON it was only about 90 J.

The overall picture of vDSP_mmul is thus different from those seen in floating point and NEON tests. When run on P cores alone, vDSP_mmul behaves more linearly, using significantly more power and energy. Once running threads on E cores, though, that breaks down and performance falls, rather than simply slowing.

The role of frequency

There has long been a tacit assumption that, when running on P cores, computationally intensive threads such as those used in these tests are run at a fairly constant frequency close to maximum. Looking back at my earlier results on M1 and M3 cores, though, frequencies aren’t so consistent, and in many cases not that close to maximum either.

powermetrics provides more frequency figures than you know what to do with, although most are derived and to some extent imaginary, making reconciliation difficult. Taking the best estimate of core frequency as that given as Cluster HW active frequency, patterns seen on M4 P cores are distinctive.

m4frequenciesByThreads

This graph shows those frequencies for the active P cluster by the number of threads, for floating point, NEON and vDSP_mmul tests. Frequencies for the first two are identical, at P core maximum for a single thread, then falling sharply from 2 to 3 threads. When more threads are run, the cluster that’s fully active is run at the same frequency as that for 5 threads (P cluster size on this M4 Pro), while the other P cluster follows the same frequencies shown in the graph for the number of threads it’s running.

Frequencies are controlled by macOS, and this suggests it adopts a standard pattern when running the two in-core tests, and a different one for vDSP_mmul presumably geared to performance of the AMX. Changes in frequency also account for the steps seen at 2-3 and 7-8 (= 5+2 and 5+3) threads in the NEON graphs above.

Power Modes

There has been considerable interest in the Power Mode setting available in macOS when running on M4 Pro and Max chips. To assess its effects I ran tests in 10 threads to fully occupy the P cores, at each of the three Power mdes.

There was no difference between results for the default Automatic and High Power modes, as expected. This is because the effect of High Power mode isn’t to change frequencies or power use in the short-term, but by more aggressive fan use enabling higher frequencies to be sustained for prolonged periods, when in Automatic mode they would be throttled.

Low Power does have substantial effects on core frequency, performance and power use, though. When running floating point tests in 10 threads, their cluster frequency was reduced from 3,852 to 3,624, 94% of Automatic and High Power. That reduced power use from a mean of 13.9 W to 11.2 W, and increased the time to complete threads. Time taken by floating point threads increased to 106% of Automatic and High, while that for NEON increased to 135% and vDSP_mmul to 177%. While the reduced performance for floating point threads is unlikely to be noticeable, for vector and matrix threads that’s likely to obvious to the user.

Key information

  • vDSP_mmul matrix multiplication from the vDSP sub-library in Accelerate behaves consistently with it being performed in the AMX co-processors in M4 Pro chips.
  • vDSP_mmul threads used significantly more power than NEON, reaching a maximum of just over 36 W when fully occupying all 10 P cores.
  • When spilt over to E cores, vDSP_mmul threads were much slower and their performance erratic, consistent with the E cluster having a smaller and significantly less performant AMX.
  • In-core tests (floating point and NEON) show common frequency regulation according to the number of cores active in each P core cluster. This runs a single thread at maximum frequency, then reduces sharply from 2 to 3 threads/cores. This accounts for the deviations from linearity observed in power use and performance. That pattern doesn’t appear in vDSP_mmul threads, though.
  • High Power and Automatic modes are identical in short-term tests.
  • Low Power mode reduces P cluster frequency and power use. Although its effects are unlikely to be noticeable in floating point threads, effects on vector and matrix threads are greater, and performance reductions are likely to be obvious to the user.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Finding and evaluating AMX co-processors in Apple silicon chips (M1 and M3)

Appendix: Source code

16 x 16 32-bit floating point matrix multiplication

var theCount: Float = 0.0
let A = [Float](repeating: 1.234, count: 256)
let IA: vDSP_Stride = 1
let B = [Float](repeating: 1.234, count: 256)
let IB: vDSP_Stride = 1
var C = [Float](repeating: 0.0, count: 256)
let IC: vDSP_Stride = 1
let M: vDSP_Length = 16
let N: vDSP_Length = 16
let P: vDSP_Length = 16
A.withUnsafeBufferPointer { Aptr in
B.withUnsafeBufferPointer { Bptr in
C.withUnsafeMutableBufferPointer { Cptr in
for _ in 1...theReps {
vDSP_mmul(Aptr.baseAddress!, IA, Bptr.baseAddress!, IB, Cptr.baseAddress!, IC, M, N, P)
theCount += 1
} } } }
return theCount

Apple describes vDSP_mmul() as performing “an out-of-place multiplication of two matrices; single precision.” “This function multiplies an M-by-P matrix A by a P-by-N matrix B and stores the results in an M-by-N matrix C.”

Inside M4 chips: CPU power, energy and mystery

By: hoakley
25 November 2024 at 15:30

Few comparisons or benchmarks for M-series chips take into account the reason for equipping Apple silicon chips with more than one CPU core type, according to Arm’s big.LITTLE architecture. Measuring single- or multi-core performance ignores the purpose of E cores, and estimating overall power use can’t compare those core types. This article tries to estimate the cost in terms of power and energy of running identical tests on M4 P and E cores, and thereby provide insight into some of the most distinctive features of Apple silicon, and their benefits.

Methods

To run these two in-core performance tests I use a GUI app wrapped around a series of loading tests designed to enable the CPU core to execute that code as fast as possible, and with as few extraneous influences as possible. Both tests used here are written in assembly code, and aren’t intended to be purposeful in any way, nor to represent anything that real-world code might run. Those are:

  • 64-bit floating point arithmetic, including an FMADD instruction to multiply and add, and FSUB, FDIV and FADD for subtraction, division and addition;
  • 32-bit 4-lane dot-product vector arithmetic (NEON), including FMUL, two FADDP and a FADD instruction;

Source code of the loops is given in the Appendix.

The GUI app sets the number of loops to be performed, and the number of threads to be run. Each set of loops is then put into the same Grand Central Dispatch queue for execution, at a Quality of Service of the maximum of 33. That ensures they are run preferentially on P cores, but will spill over to E cores when no P core is available, when more than 10 threads are run concurrently. Timing of thread execution is performed using Mach Absolute Time, and the time for each thread to be executed is displayed at the end of the tests.

For these tests, the total number of loops to be executed in each thread was set at 5 x 10^8 for floating point, and 3.5 x 10^9 for NEON. Those values were chosen to take 2-3 seconds per thread, to ensure the whole test period was available for analysis.

Immediately before running each test, I launch powermetrics from the command line, to gather core power and performance data in sampling periods of 0.1 second for a total of 50 samples. Its output is piped into a text file, which is then analysed using Numbers and DataGraph. All tests were conducted on a Mac mini M4 Pro with 10 P and 4 E cores, running macOS 15.1.1 in standard power mode.

Each test was inspected individually, and seen to contain the following phases:

  1. small initial activity resulting from bringing the GUI app into focus, and clicking the Run button;
  2. a brief period of low activity, typically with total CPU power at below 50 mW;
  3. 1-2 sample periods when threads are loaded onto the cores;
  4. 15-21 sample periods when threads are run, whose total CPU power measurements are collected for analysis;
  5. 1-2 sample periods when threads are unloaded;
  6. a return to low activity, typically with total CPU power returning below 50 mW.

Means and standard deviations were then calculated for each series of power measurements, and pooled with times taken to execute threads.

Power used by thread

The first pair of graphs shows average power use for the number of threads run, shown here with error bars giving the range of +1 standard deviation. These show two sections: for 1-10 threads, when all were running on P cores, and for 11-14 threads, when the 10 P cores were fully committed and 1-4 threads spilt over to run on E cores at their maximum frequency. Maximum power used during testing was just short of 34 W.

m4powerflopt1

That for the floating point test above, and NEON below, have regression lines fitted, indicating that:

  • Each additional floating point thread required 1,300 mW on P cores, and 110 mW on E cores.
  • Each additional NEON thread required 3,000 mW on P cores, and 280 mW on E cores.
  • P cores thus required 11-12 times the power of E cores, or E cores used 8-9% of the power of P cores.

m4powerneon1

Although linear regressions aren’t a bad fit, there’s consistent deviation from the linear relationship seen in previous analyses on M1 and M3 cores. More remarkably, the pattern of deviation is identical between these two tests, although they run in different units in these cores. In both cases, power use was high for 2 and 7 threads, while that for 3 and 8 threads was slightly lower. The only unusual pattern seen in powermetrics output was that, when running 2 and 7 threads, thread mobility was much higher than in other tests.

Previous tests on M1 and M3 P cores found that each additional floating point thread run on those requires about 935 mW, indicating a substantial increase in power used by M4 P cores when running at their higher maximum frequency. E cores in an M1 Pro require about 100 mW each when running at maximum frequency, similar to those in the M4.

Execution time

As power is the rate of energy use over time, the next step is to examine total execution time for all the threads running concurrently, which should form a linear relationship with different gradients for P and E cores. The next two graphs demonstrate that.

m4powerflopt2

For both floating point (above) and NEON (below), there’s a tight linear relationship between total execution time and numbers of threads. Floating point demonstrates that each thread costs 2.4 seconds on P cores and 3.6 seconds on E cores, making E core execution time 150% that of P cores. NEON is similar, at 2.5 seconds on P cores and 3.4 seconds on E cores, for a ratio of 136%.

m4powerneon2

Time taken for the slowest thread to complete execution shows interesting finer detail.

m4powerflopt3

For both tests, performance falls into several sections according to the number of threads run. With less than 5 threads run, there’s a sharp rise in time taken per thread. From 5-10 threads, time required remains constant, before increasing from 10-14 threads, when additional threads are spilt over onto E cores.

This has implications for anyone trying to measure core performance, as it demonstrates that a single thread can run disproportionately fast, compared with 3-10 threads. Basing any conclusion or comparison on a single thread completing in little more than 2 seconds, when 5 concurrent threads would take 2.34 seconds, 117% of the single thread, could be misleading.

m4powerneon3

Energy use

Although power use determines heat production, so is an important factor in determining cooling requirements, total energy required to execute threads is equally important for Macs running from battery. Simply reducing core frequency will reduce power used, but by extending the time taken to complete tasks, it may have no effect on energy used, and battery endurance. My final two graphs therefore show estimated total energy used when running test threads on P and E cores, the ultimate test of any big.LITTLE CPU design such as that in the M4.

m4powerflopt4

Graphs for floating point (above) and NEON (below) are inevitably similar in form to those for power, with a near-linear section from 1-10 cores, when the threads are run only on P cores, and from 11-14 cores when they also spill over to E cores.

Fitted regression lines provide the energy cost for each additional thread:

  • For floating point, each thread run on a P core costs 3.1 J, and for an E core 1.5 J, making the energy used by an E core 47% that of a P core.
  • For NEON, P cores cost 7.7 J per thread, and E cores 3.0 J, making the energy used by an E core 38% that of a P core.

It’s important to remember that the E cores here aren’t being run at frequencies for high efficiency, but at their maximum so they can substitute for the P cores that are already in use.

m4powerneon4

Considering the small deviations from those linear relationships, it appears that running 2, 6 or 7 threads on P cores requires slightly more energy than predicted from the regression lines shown.

Unfortunately, assessing the energy used by E cores running at low frequencies, as they normally do when performing background tasks, is fraught with inaccuracies due to their low power use. My previous estimate for floating point tests is that a slow-running E core uses less than 45 mW per thread, and for the same task requires about 7% of the energy used by a P core running at maximum frequency, but I have lower confidence in the accuracy of those figures than in those above for higher frequencies.

Key information

  • When running the same code at maximum frequency, E cores used 8-9% of the power of P cores.
  • Power use when running 2 or 7 threads was anomalously high, possibly due to high thread mobility.
  • Execution on E cores was significantly slower than on P cores, at 136-150% of the time required on P cores.
  • Single-core performance measurements may not be accurate reflections of performance on multiple cores.
  • When running the same code at maximum frequency, energy used by an E core is expected to be 38-47% that of a P core.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance

Appendix: Source code

_fpfmadd:
STR LR, [SP, #-16]!
MOV X4, X0
ADD X4, X4, #1
FMOV D4, D0
FMOV D5, D1
FMOV D6, D2
LDR D7, INC_DOUBLE
fp_while_loop:
SUBS X4, X4, #1
B.EQ fp_while_done
FMADD D0, D4, D5, D6
FSUB D0, D0, D6
FDIV D4, D0, D5
FADD D4, D4, D7
B fp_while_loop
fp_while_done:
FMOV D0, D4
LDR LR, [SP], #16
RET

_neondotprod:
STR LR, [SP, #-16]!
LDP Q2, Q3, [X0]
FADD V4.4S, V2.4S, V2.4S
MOV X4, X1
ADD X4, X4, #1
dp_while_loop:
SUBS X4, X4, #1
B.EQ dp_while_done
FMUL V1.4S, V2.4S, V3.4S
FADDP V0.4S, V1.4S, V1.4S
FADDP V0.4S, V0.4S, V0.4S
FADD V2.4S, V2.4S, V4.4S
B dp_while_loop
dp_while_done:
FMOV S0, S2
LDR LR, [SP], #16
RET

Inside M4 chips: CPU core performance

By: hoakley
20 November 2024 at 15:30

There’s no doubt that the CPUs in M4 chips outperform their predecessors. General-purpose benchmarks such as Geekbench demonstrate impressive rises in both single- and multi-core results, in my experience from 3,191 (M3 Pro) to 3,892 (M4 Pro), and 15,607 (M3 Pro) to 22,706 (M4 Pro). But the latter owes much to the increase in Performance (P) core count from 6 to 10. In this series I concentrate on much narrower concepts of performance in CPU cores, to provide deeper insight into topics such as core types and energy efficiency. This article examines the in-core performance of P and E cores, and how they differ.

P core frequencies have increased substantially since the M1. If we set that as 100%, M3 P cores run at around 112-126% of that frequency, and those in the M4 at 140%.

E cores are more complex, as they have at least two commonly used frequencies, that when running low Quality of Service (QoS) threads, and that when running high QoS threads that have spilt over from P cores. Low QoS threads are run at 77% of M1 frequency when on an M3, and 105% on an M4. High QoS threads are normally run at higher frequencies of 133% on the M3 E cores (relative to the M1 at 100%), but only 126% on the M4.

Methods

To measure in-core performance I use a GUI app wrapped around a series of loading tests designed to enable the CPU core to execute that code as fast as possible, and with as few extraneous influences as possible. Of the seven tests reported here, three are written in assembly code, and the others call optimised functions in Apple’s Accelerate library from a minimal Swift wrapper. These tests aren’t intended to be purposeful in any way, nor to represent anything that real-world code might run, but simply provide the core with the opportunity to demonstrate how fast it can be run at a given frequency.

The seven tests used here are:

  • 64-bit integer arithmetic, including a MADD instruction to multiply and add, a SUBS to subtract, an SDIV to divide, and an ADD;
  • 64-bit floating point arithmetic, including an FMADD instruction to multiply and add, and FSUB, FDIV and FADD for subtraction, division and addition;
  • 32-bit 4-lane dot-product vector arithmetic (NEON), including FMUL, two FADDP and a FADD instruction;
  • simd_float4 calculation of the dot-product using simd_dot in the Accelerate library.
  • vDSP_mmul, a function from the vDSP sub-library in Accelerate, multiplies two 16 x 16 32-bit floating point matrices, which in M1 and M3 chips appears to use the AMX co-processor;
  • SparseMultiply, a function from Accelerate’s Sparse Solvers, multiplies a sparse and a dense matrix, that may use the AMX co-processor in M1 and M3 chips.
  • BNNSMatMul matrix multiplication of 32-bit floating-point numbers, here in the Accelerate library, and since deprecated.

Source code of the loops is given in the Appendix.

The GUI app sets the number of loops to be performed, and the number of threads to be run. Each set of loops is then put into the same Grand Central Dispatch queue for execution, at a set Quality of Service (QoS). Timing of thread execution is performed using Mach Absolute Time, and the time for each thread to be executed is displayed at the end of the tests.

I normally run tests at either the minimum QoS of 9, or the maximum of 33. The former are constrained by macOS to be run only on E cores, while the latter are run preferentially on P cores, but may spill over to E cores when no P core is available. All tests are run with a minimum of other activities on that Mac, although it’s not unusual to see small amounts of background activity on the E cores during test runs.

The number of loops completed per second is calculated for two thread totals for each of the three execution contexts. Those are:

  • P cores alone, based on threads run at high QoS on 1 and 10 P cores;
  • E cores at high frequency (‘fast’), run at high QoS on 10 cores (no threads on E cores) and 14 (4 threads on E cores);
  • E cores at low frequency (‘slow’), run at low QoS on 1 and 4 E cores.

Results are then corrected by removing overhead estimated as the rate of running empty loops. Finally, each test is expressed as a percentage of the performance achieved by the P cores in an M1 chip. Thus, a loop rate double that achieved by running the same test on an M1 P core is given as 200%.

P core performance

As in subsequent sections, these are shown in the bar chart below, in which the pale blue bars are for M1 P cores, dark blue bars for M3 P cores, and red for M4 P cores.

m134coreperf1

As I indicated in my preview of in-core performance, there is little difference between integer performance between M3 and M4 P cores, but a significant increase in floating point, which matches that expected by the increased frequency in the M4.

Vector performance in the NEON and simd dot tests, and matrix multiplication in vDSP mmul rise higher than would be expected by frequency differences alone, to over 160% of M1 performance. The latter two tests are executed using Accelerate library calls, so there’s no guarantee that they are executed the same on different chips, but the first of those is assembly code using NEON instructions. SparseMultiply and BNNS matmul are also Accelerate functions whose execution may differ, and don’t fare quite as well on the M4.

E core slow performance

m134coreperf2

On E cores, threads run at low QoS are universally at frequencies close to idle, as reflected in their performance, still relative to an M1 P core at much higher frequency. Frequency differences account for the relatively poor performance of M3 E cores, and improvement in results for the M4. Those are disproportionate in vector and matrix tests, which could be accounted for by the M4 E core running those at higher frequencies than would be normal for low QoS threads.

Although the best of these, NEON, is still well below M1 P core performance (73%), this suggests a design decision to deliver faster vector processing on M4 E cores, which is interesting.

E core fast performance

m134coreperf3

Ideally, when high QoS threads overspill from P cores, it’s preferable that they’re executed as fast as they would have been on a P core. Those in the M1 fall far short of that, in scalar and vector tests only delivering 40-60% of a P core, although that seemed impressive at the time. The M3 does considerably better, with vector and one matrix test slightly exceeding the M1 P core, and the M4 is even faster in vector calculations, peaking at over 130% for NEON assembly code.

Far from being a cut-down version of its P core, the M4 E core can now deliver impressive vector performance when run up to maximum frequency.

M4 P and E comparison

Having considered how P and E cores have improved against those in the M1, it’s important to look at the range of computing capacity they provide in the M4. This is shown in the chart below, where pale blue bars are P cores, red bars E cores at high QoS and frequency, and dark blue bars E cores at low QoS and frequency. Again, these are all shown relative to P core performance in the M1.

m134coreperf4

Apart from the integer test, scalar floating point, vector and matrix calculations on P cores range between 140-175% those of the M1, a significant increase on those expected from frequency increase alone. Scalar and vector (but not matrix) calculations on E cores at high frequency are slower, although in most situations that shouldn’t be too noticeable. Performance does drop off for E cores at low frequency, though, and would clearly have impact for code.

Given the range of operating frequencies, P and E cores in the M4 chip deliver a wide range of performance at different power levels, and its power that I’ll examine in the next article in this series.

Key information

  • M4 P core maximum frequency is 140% that of the M1. That increase in frequency accounts for much of the improved P core performance seen in M4 chips.
  • E core frequency changes are more complex, and some have reduced rather than risen compared with the M1.
  • P core floating point performance in the M4 has increased as would be expected by frequency change, and vector and matrix performance has increased more, to over 160% those of the M1.
  • E core performance at low QoS and frequency has improved in comparison to the M3, and most markedly in vector and matrix tests, suggesting design improvements in the latter.
  • E core performance at high QoS and frequency has also improved, again most prominently in vector tests.
  • Across their frequency ranges, M4 P and E cores now deliver a wide range of performance and power use.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores

Appendix: Source code

_intmadd:
STR LR, [SP, #-16]!
MOV X4, X0
ADD X4, X4, #1
int_while_loop:
SUBS X4, X4, #1
B.EQ int_while_done
MADD X0, X1, X2, X3
SUBS X0, X0, X3
SDIV X1, X0, X2
ADD X1, X1, #1
B int_while_loop
int_while_done:
MOV X0, X1
LDR LR, [SP], #16
RET

_fpfmadd:
STR LR, [SP, #-16]!
MOV X4, X0
ADD X4, X4, #1
FMOV D4, D0
FMOV D5, D1
FMOV D6, D2
LDR D7, INC_DOUBLE
fp_while_loop:
SUBS X4, X4, #1
B.EQ fp_while_done
FMADD D0, D4, D5, D6
FSUB D0, D0, D6
FDIV D4, D0, D5
FADD D4, D4, D7
B fp_while_loop
fp_while_done:
FMOV D0, D4
LDR LR, [SP], #16
RET

_neondotprod:
STR LR, [SP, #-16]!
LDP Q2, Q3, [X0]
FADD V4.4S, V2.4S, V2.4S
MOV X4, X1
ADD X4, X4, #1
dp_while_loop:
SUBS X4, X4, #1
B.EQ dp_while_done
FMUL V1.4S, V2.4S, V3.4S
FADDP V0.4S, V1.4S, V1.4S
FADDP V0.4S, V0.4S, V0.4S
FADD V2.4S, V2.4S, V4.4S
B dp_while_loop
dp_while_done:
FMOV S0, S2
LDR LR, [SP], #16
RET

func runAccTest(theA: Float, theB: Float, theReps: Int) -> Float {
var tempA: Float = theA
var vA = simd_float4(theA, theA, theA, theA)
let vB = simd_float4(theB, theB, theB, theB)
let vC = vA + vA
for _ in 1...theReps {
tempA += simd_dot(vA, vB)
vA = vA + vC
}
return tempA
}

16 x 16 32-bit floating point matrix multiplication

var theCount: Float = 0.0
let A = [Float](repeating: 1.234, count: 256)
let IA: vDSP_Stride = 1
let B = [Float](repeating: 1.234, count: 256)
let IB: vDSP_Stride = 1
var C = [Float](repeating: 0.0, count: 256)
let IC: vDSP_Stride = 1
let M: vDSP_Length = 16
let N: vDSP_Length = 16
let P: vDSP_Length = 16
A.withUnsafeBufferPointer { Aptr in
B.withUnsafeBufferPointer { Bptr in
C.withUnsafeMutableBufferPointer { Cptr in
for _ in 1...theReps {
vDSP_mmul(Aptr.baseAddress!, IA, Bptr.baseAddress!, IB, Cptr.baseAddress!, IC, M, N, P)
theCount += 1
} } } }
return theCount

Apple describes vDSP_mmul() as performinng “an out-of-place multiplication of two matrices; single precision.” “This function multiplies an M-by-P matrix A by a P-by-N matrix B and stores the results in an M-by-N matrix C.”

Sparse matrix multiplication

var theCount: Float = 0.0
let rowCount = Int32(4)
let columnCount = Int32(4)
let blockCount = 4
let blockSize = UInt8(1)
let rowIndices: [Int32] = [0, 3, 0, 3]
let columnIndices: [Int32] = [0, 0, 3, 3]
let data: [Float] = [1.0, 4.0, 13.0, 16.0]
let A = SparseConvertFromCoordinate(rowCount, columnCount, blockCount, blockSize, SparseAttributes_t(), rowIndices, columnIndices, data)
defer { SparseCleanup(A) }
var xValues: [Float] = [10.0, -1.0, -1.0, 10.0, 100.0, -1.0, -1.0, 100.0]
let yValues = [Float](unsafeUninitializedCapacity: xValues.count) {
resultBuffer, count in
xValues.withUnsafeMutableBufferPointer { denseMatrixPtr in
let X = DenseMatrix_Float(rowCount: 4, columnCount: 2, columnStride: 4, attributes: SparseAttributes_t(), data: denseMatrixPtr.baseAddress!)
let Y = DenseMatrix_Float(rowCount: 4, columnCount: 2, columnStride: 4, attributes: SparseAttributes_t(), data: resultBuffer.baseAddress!)
for _ in 1...theReps {
SparseMultiply(A, X, Y)
theCount += 1
} }
count = xValues.count
}
return theCount

Apple describes SparseMultiply() as performing “the multiply operation Y = AX on a sparse matrix of single-precision, floating-point values.” “Use this function to multiply a sparse matrix by a dense matrix.”

Sparse matrix multiplication

var theCount: Float = 0.0
let inputAValues: [Float] = [ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 ]
let inputBValues: [Float] = [
1, 2,
3, 4,
1, 2,
3, 4,
1, 2,
3, 4]
var inputADescriptor = BNNSNDArrayDescriptor.allocate(
initializingFrom: inputAValues,
shape: .imageCHW(3, 4, 2))
var inputBDescriptor = BNNSNDArrayDescriptor.allocate(
initializingFrom: inputBValues,
shape: .tensor3DFirstMajor(3, 2, 2))
var outputDescriptor = BNNSNDArrayDescriptor.allocateUninitialized(
scalarType: Float.self,
shape: .imageCHW(inputADescriptor.shape.size.0,
inputADescriptor.shape.size.1,
inputBDescriptor.shape.size.1))
for _ in 1...theReps {
BNNSMatMul(false, false, 1,
&inputADescriptor, &inputBDescriptor, &outputDescriptor, nil, nil)
theCount += 1
}
inputADescriptor.deallocate()
inputBDescriptor.deallocate()
outputDescriptor.deallocate()
return theCount

Inside M4 chips: E and P cores

By: hoakley
18 November 2024 at 15:30

In the two previous articles (links at the end), I explored some of the features and properties of Performance (P) cores in Apple’s latest M4 chips. This article looks at their Efficiency (E) cores by comparison.

M4 family

In the three current M4 designs, there are only two variations in terms of E cores:

  • Base M4, with 6 E cores, except for a cheaper variant with only 4 active E cores.
  • M4 Pro and Max, with 4 E cores, including ‘binned’ variants.

Apple is expected to release an Ultra variant in 2025, with two M4 Max chips in tandem, providing a total of 8 E cores. Apart from the number of cores, all E cores are the same, and different from P cores.

E core architecture

All E cores are arranged in a single cluster of 4 or 6, sharing common L2 cache, and running at the same frequency (clock speed). Analysis of M1 cores implies that each E core has roughly half the number of processing units, where there is more than one such unit in the P core, giving an M1 E core roughly half the compute capacity of the P core. I haven’t seen any comparable analysis of cores in later M families, although differences in power consumption imply there remain substantial differences in processing units and compute capacity.

Frequency

Like P cores, E cores can be set to run at any of 5 values between the minimum of 1,020 MHz and maximum of 2,592 MHz (1.0-2.6 GHz). When running macOS, cluster frequency is set by macOS at a kernel level; other operating systems may offer more direct control. This range of frequencies is significantly narrower than that of E cores in the M3, which range between 744-2,748 MHz.

E cores idle at 1,020 MHz, and although they can be shut down altogether, that’s exceptional given the steady demand for macOS background threads to be run on them. Nevertheless, powermetrics still reports their ‘down’ residencies separately from idle residencies.

Instruction set

This is believed to be identical to ARMv9.2-A without Scalable Vector Extension (SVE) supported by M4 P cores, enabling the same threads to be run on either core type.

Single thread comparisons

One way to appreciate the contrasts between core types is to compare a single intensive in-core thread run in each. For this purpose, I used a tight loop of floating point calculations, running at two different Quality of Service (QoS) settings, in macOS 15.1.

Single thread at high QoS on P cores

m4singlePflopt1

This thread was initially loaded onto P13 (red) in the second (P1) cluster, and after 3.7 seconds was moved to P5 (blue) in the first (P0) cluster. After a further 4.6 seconds running on that, it was moved back to the second (P1) cluster, to run on P11 (purple). During this run, there was almost no other activity on the two P clusters, and the inactive cluster was therefore shut down while this thread was running on the other.

m4singlePflopt2

The active cluster was run at the maximum frequency of 4,511 MHz throughout. Just before the thread was moved to a different cluster, that was brought up and run up to maximum frequency ready to run the thread.

m4singlePflopt3

Total CPU power remained similar throughout the period the thread was being executed, but there is a small and consistent difference according to which cluster was active: the first (P0) brought power use of about 2,520 mW, 50 mW higher than the second (P1) at about 2,470 mW. This matches the difference reported previously, and merits assessment in other M4 Pro chips to determine whether this is a general feature.

Single thread at high QoS on E cores

There are methods of running code, such as the in-core floating point loop test used here, on E cores: they can be run with a low QoS (Background), so that macOS allocates them to run on only E cores, or they can be spilt over from high QoS threads when there are more threads than available P cores. On an M4 Pro chip, that requires 11 threads, which results in one of those being allocated to the E cluster, as described next.

m4singlePonEflop1

This chart shows active residency on the four E cores with a single high QoS thread spilt onto them. While cores E1, E2 and E3 appear to handle other threads over this period of more than six seconds, core E0 appears to run at 90-100% active residency executing the spilt thread. Note that this thread wasn’t moved between cores over that period of over six seconds.

E cluster frequency remained constant throughout at its maximum of 2,592 MHz. CPU power use was inevitably dominated by the ten P cores running at 100% active residency and maximum frequency, remaining at just under 14,000 mW. Unfortunately, using powermetrics it’s not possible to estimate the power use of the E cluster directly.

Single thread at low QoS on E cores

This is very different from the spilt thread at high QoS.

m4singleEflop1

There’s no evidence here that any single core in the E cluster ran a thread at 100% active residency. Instead it appears to have been moved rapidly and freely around the cores, with many 0.1 second sampling intervals spanning its execution in more than one core over that period.

m4singleEflop2

Cluster frequency was a steady minimum of 1,050-1,060 MHz, with superimposed spikes when it rose briefly to the maximum of 2,592 MHz. This suggests that the single thread would most probably have been run at close to core minimum frequency, had there not been additional threads to run.

m4singleEflop3

A similar picture is seen in power use, with spikes from a low background of about 40-45 mW required by the single thread alone.

Single thread behaviours

These can be summarised as:

  • P core (high QoS) runs at 100% active residency on a single P core at maximum frequency, and is switched between clusters irregularly (about every 3.7-4.6 seconds). Total power use is about 2,500 mW.
  • High QoS spilt over to E cores runs at 90-100% active residency on a single E core at maximum frequency, and is either not switched between cores at all, or only infrequently.
  • E core (low QoS) runs at about 100% and is moved frequently between all E cores in the cluster, at close to minimum frequency. Total power use is about 40-45 mW.

Performance, power and efficiency

Although I’ll be returning to more detailed comparisons of performance and power use between P and E cores, I provide a single illustration here, for the in-core floating point task used above.

Running 2 x 10^9 loops in each thread, P cores at maximum frequency take 9.2-9.7 seconds per thread, and use about 2,500 mW per thread. E cores running low QoS threads at close to minimum frequency take about four times as long, 38.5 seconds, but use less than 45 mW power per thread. Total energy used to complete one thread is therefore over 23 J when run on P cores, and less than 1.7 J when run on E cores. E cores therefore use only 7% of the energy that P cores do performing the same task.

Key information

  • Current M4 chips feature 4-6 CPU E cores.
  • M4 E cores are arranged in a single cluster of 4 or 6, sharing L2 cache and running at a common frequency.
  • The E core cluster can be shut down (exceptionally), idling at their minimum frequency of 1,020 MHz, or at one of 6 set frequencies up to a maximum of 2,592 MHz, as controlled by macOS.
  • Their instruction set is the same as M4 P cores, ARMv9.2-A without its Scalable Vector Extension (SVE).
  • They use 40-45 mW when at low frequencies, but it’s not currently feasible to measure directly their maximum power use at high frequencies.
  • macOS allocates threads to E cores when their QoS is 9 (Background), and when a thread with higher QoS can’t be allocated to a P core because they are all busy. Management of frequencies and core allocation differ between those two cases.
  • High QoS threads on E cores are run at maximum frequency and appear not to move between cores.
  • Low QoS threads on E cores are run at close to minimum frequency and are highly mobile between cores.
  • Low QoS threads running on E cores run more slowly than higher QoS threads running on P cores, but E core power use is much lower, resulting in considerable saving in total energy use for the same computational task.

Previous article

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

M4 Macs can’t virtualise older macOS

By: hoakley
14 November 2024 at 15:30

If you’ve already got a new M4 Mac and tried to run a macOS virtual machine on it, then you might have been disappointed. It seems that M4 chips can’t virtualise any version of macOS before 13.4 Ventura. So before you trade in or pass on your M1, M2 or M3 Mac, if you need access to older VMs, you might like to check whether this affects you.

I’m indebted to Csaba Fitzl for drawing my attention to this problem, and for reporting it to Apple in Feedback FB15774587. It has also been reported as affecting UTM, and I believe affects all other macOS virtualisation software for Apple silicon.

The bug

Running a macOS VM for any version before 13.4 Ventura on an M4 Mac results in a black screen, and the VM fails to boot. This is true whatever settings are used in the virtualiser, even if it’s set to boot the VM in Recovery mode. It’s also true when that VM has been built on that Mac: although that appears to complete successfully, when first run that VM opens as a black screen and never proceeds with personalisation and setup.

Currently the only way to run a VM with macOS prior to 13.4 Ventura is to do so on a host with an M1, M2 or M3 chip.

Can this be fixed?

Unfortunately, as this bug prevents the VM from booting, there’s no reliable way to access its log to discover what’s going wrong there. There’s no sign of the failure in the host’s log either: the host appears to initialise its Virtio and other support normally, without errors or faults. After those, virtualisation processes on the host fall silent as they wait for the VM to start, which never happens.

There is a useful clue in Activity Monitor, though: in its CPU pane, despite being allocated multiple virtual cores, only one is seen to be active on the host. That implies the failure is occurring before the VM kernel boots the other cores, an event that occurs early during the kernel boot phase. Until that point, pre-boot phases and the kernel run on just a single CPU core.

macOS 13.4 updated iBoot to version 8422.121.1, so at first sight the VM could be failing when running older firmware. That doesn’t appear likely, as version 8422.121.1 was also installed in the 12.6.6 security update, so 12.7.x shouldn’t suffer this problem, but it does.

It thus appears most likely that this bug strikes in the early part of kernel boot, in which case the most feasible solution would be to fix the bug in macOS kernels prior to 13.4, and promulgate new IPSW image files for those. I suspect that’s very unlikely to happen, and as far as I’m aware it would be the first time that Apple has issued revised IPSWs.

Which macOS can you virtualise?

Support for lightweight virtualisation of macOS on Apple silicon Macs was still in progress in the first version of macOS to run on M1 chips, Big Sur. You therefore cannot create or run Big Sur VMs.

macos12

The first versions that can run in VMs are macOS 12 Monterey, although prior to 12.4 they can sometimes be a bit fractious. They also have major limitations, such as not supporting shared folders with the host.

macos121

This is 12.1 with its old System Preferences, running happily on an M3 Pro.

macos13

macOS Ventura should run well on M1, M2 and M3 hosts, and 13.4 and later on M4 hosts too.

macos14

macOS Sonoma should run even better on all current Apple silicon Macs, and delivers a much improved display with autoscaling. However, 14.2 and 14.2.1 don’t automatically support shared folders because of a bug that was fixed in 14.3.

macos15

macOS Sequoia is fully compatible, and adds support for iCloud Drive and some other Apple Account features, although still won’t run App Store apps. It may also fail to install the required extras to support Apple Intelligence.

Summary

  • Currently, M4 Macs can only run VMs of macOS Ventura 13.4 and later, 14 Sonoma, and 15 Sequoia.
  • M1, M2 and M3 Macs can run VMs of macOS Monterey 12.0.1 and later.
  • macOS Big Sur 11 can’t be virtualised on Apple silicon.

Further information

Viable and virtualisation
Why macOS 14.2 & 14.2.1 VMs lose shared folders, and how to work around it

Postscript

I’m grateful for the suggestion that it might be possible to work around this problem by running the older VM on a single core. Although I did manage to do that on an M3 (the first time I have seen recent macOS running on just one core!), that fails just the same on an M4, I’m afraid.

Inside M4 chips: P cores hosting a VM

By: hoakley
13 November 2024 at 15:30

One common but atypical situation for any M-series chip is running a macOS virtual machine. This article explores how virtual CPU cores are handled on physical cores of an M4 Pro host, and provides further insight into their management, and thread mobility across P core clusters.

Unless otherwise stated, all results here are obtained from a macOS 15.1 Sequoia VM in my free virtualiser Viable, with that VM allocated 5 virtual cores and 16 GB of memory, on a Mac mini M4 Pro running macOS Sequoia 15.1 with 48 GB of memory, 10 P and 4 E cores.

How virtual cores are allocated

All virtualised threads are treated by the host as if they are running at high Quality of Service (QoS), so are preferentially allocated to P cores, even though their original thread may be running at the lowest QoS. This has the side-effect of running virtual background processes considerably quicker than real background threads on the host.

In this case, the VM was given 5 virtual cores so they could all run in a single P cluster on the host. That doesn’t assign 5 physical cores to the VM, but runs VM threads up to a total of 500% active residency across all the P cores in the host. If the VM is assigned more virtual cores than are available in the host’s P cores, then some of its threads will spill over and be run on host E cores, but at the high frequency typical of host threads with high QoS.

Performance

There is a slight performance hit in virtualisation, but it’s surprisingly small compared to other virtualisers. Geekbench 6.3.0 benchmarks for guest and host were:

  • CPU single core VM 3,643, host 3,892
  • CPU multi-core VM 12,454, host 22,706
  • GPU Metal VM 102,282, host 110,960, with the VM as an Apple Paravirtual device.

Some tests are even closer: using my in-core floating point test, 1,000 Mloops run in the VM in 4.7 seconds, and in the host in 4.68 seconds.

Host core allocation

To assess P core allocation on the host, an in-core floating point test was run in the VM. This consisted of 5 threads with sufficient loops to fully occupy the virtual cores for about 20 seconds. In the following charts, I show results from just the first 15 seconds, as representative of the whole.

vmpcoresm4pro0

When viewed by cluster, those threads were mainly loaded first onto the first P cluster (pale blue bars), where they ran for just over 1 second before being moved to the second cluster (red bars). They were then regularly switched between the two P clusters every few seconds throughout the test. Four cycles were completed in this section of the results, with each taking 2.825 seconds, so threads were switched between clusters every 1.4 seconds, the same time as I found when running threads on the host alone, as reported previously.

For most of the 15 seconds shown here, total active residency across both P clusters was pegged at 500%, as allocated to the VM in its 5 virtual cores, with small bursts exceeding that. Thus that 500% represents those virtual cores, and the small bursts are threads from the host. Although the great majority of that 500% was run on the active P cluster, a total of about 30% active residency consisted of other threads from the VM, and ran on the less active P cluster. That probably represents the VM’s macOS background processes and overhead from its folder sharing, networking, and other Virtio device use.

vmpcoresm4pro1

When broken down to individual cores within each cluster, seen in the first above and the second below, total activity differs little across the cores in the active cluster. During its period in the active cluster, each core had an active residency of 80-100%, bringing the cluster total to about 450% while most active.

vmpcoresm4pro2

In case you’re wondering whether this occurs on older Apple silicon, and it’s just a feature of macOS Sequoia, here’s a similar example of a 4-core VM running 3 floating point threads in an M1 Max with macOS 15.1, seen in Activity Monitor’s CPU History window. There’s no movement of threads between clusters.

vmm1maxtest1

P core frequencies

powermetrics, used to obtain this data, provides two types of core frequency information. For each cluster it gives a hardware active frequency, then for each core it gives an individual frequency, which often differs within each cluster. Cores in the active P cluster were typically reported as running at a frequency of 4512 MHz, although the cluster frequency was lower, at about 3858 MHz. For simplicity, cluster frequencies are used here.

vmpcoresm4pro3

This chart shows reported frequencies for the two P clusters in the upper lines. Below them are total cluster active residencies to show which cluster was active during each period.

The active cluster had a steady frequency of just below 3,900 MHz, but when it became the less active one, its frequency varied greatly, from idle at 1,260 MHz up to almost 4,400 MHz, often for very brief periods. This is consistent with the active cluster running the intensive in-core test threads, and the other cluster handling other threads from both the VM and host.

CPU power

Several who have run VMs on notebooks report that they appear to drain the battery quickly. Using the previous results from the host, the floating point test used here would be expected to use a steady 7,000 mW.

vmpcoresm4pro4

This last chart shows the total CPU power use in mW over the same period, again with cluster active residency (here multiplied by 10), added to aid recognition of cluster cycles. This appears to average about 7,500 mW, only 500 mW more than expected when run on the host alone. That shouldn’t result in a noticeable increase in power usage in a notebook.

In the previous article, I remarked on how power used appeared to differ between the two clusters, and this is also reflected in these results. When the second cluster (P1) is active, power use is less, at about 7,100 mW, and it’s higher at about 7,700 when the first cluster (P0) is active. This needs to be confirmed on other M4 Pro chips before it can be interpreted.

Key information

  • macOS guests perform almost as well as the M4 Pro host, although multi-core benchmarks are proportionate to the number of virtual cores allocated to them. In particular, Metal GPU performance is excellent.
  • All threads in a VM are run as if at high QoS, thus preferentially on host P cores. This accelerates low QoS background threads running in the VM.
  • Virtual core allocation includes all VM overhead from the VM, such as its macOS background threads.
  • Guest threads are as mobile as those of the host, and are moved between P clusters every 1.4 seconds.
  • Although threads run in a VM incur a small penalty in additional power use, this shouldn’t be significant for most purposes.
  • Once again, evidence suggests that the first P cluster (P0) in an M4 Pro uses slightly more power than the second (P1). This needs to be confirmed in other systems.
  • powermetrics can’t be used in a VM, not unsurprisingly.

Previous article

Inside M4 chips: P cores

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

Inside M4 chips: P cores

By: hoakley
11 November 2024 at 15:30

This is the first in a series diving deeper into Apple’s new M4 family of chips. This starts with details of its Performance (P) cores. Comparisons of their performance against cores in earlier M-series chips will follow separately when I have completed them.

M4 family

There are currently three M4 designs:

  • Base M4, with 4 P and 6 E cores, also available in a cheaper variant with only 4 active E cores, and a ‘binned’ variant for iPads with only 3 active P cores.
  • M4 Pro, with 10 P and 4 E cores, also available in a ‘binned’ variant with only 8 active P cores.
  • M4 Max, with 12 P and 4 E cores, also available in a ‘binned’ variant with only 10 active P cores.

Apple is expected to release an Ultra variant in 2025, consisting of two M4 Max chips connected and working in tandem, providing a total of 24 P and 8 E cores.

Apart from the number of cores in each design, their caches and memory, all P cores are the same, and different from E cores.

P core architecture

All CPU cores are arranged in clusters of up to 6. All cores within any given cluster share L2 cache, and are run at the same frequency (clock speed). The Base M4 has a single cluster of 4 P cores, while the Pro and Max have two clusters of 5 and 6 cores respectively.

Frequency

A prominent feature of both P and E cores is their variable frequency (clock speed). In the case of P cores, this can be set to any of 17 values between the minimum of 1,260 MHz and maximum of 4,512 MHz (1.3-4.5 GHz). When running macOS, cluster frequencies are set by macOS at a kernel level; other operating systems may offer more direct control.

P cores idle at 1,260 MHz, but can also be shut down altogether. Previous M-series chips have been reported by the powermetrics command tool as sometimes being idle at a frequency of 0 MHz, but the M4 is the first to have idle and down states reported separately, for example:
CPU 4 active residency: 0.00%
CPU 4 idle residency: 0.00%
CPU 4 down residency: 100.00%

when that core and its whole cluster are shut down rather than just idling. It’s not clear whether this is merely an administrative change, or M4 cores implement this state differently from previous cores.

Instruction set

There’s confusion over the Instruction Set Architecture (ISA) supported by M4 cores. This is explained in the LLVM source, where it’s claimed that they’re “technically” ARMv9.2-A, but without its Scalable Vector Extension (SVE). Some might consider that’s closer to ARMv8.7-A, one version more recent than the M3’s ARMv8.6-A.

Although this is now fully supported in LLVM clang, it’s not clear how fully it’s supported by Xcode, for example.

Power

When shut down, a P core consumes no power, of course, and at idle with no active residency, it uses only 1-2 mW, according to measurements reported by powermetrics.

Maximum power consumption rises to approximately 1,400 mW when running intensive floating point calculations at 100% active residency, and to approximately 3,230 mW when running NEON vector instructions at 100% active residency.

macOS core allocation

Threads are normally allocated by macOS to an available P core when their designated Quality of Service (QoS) is higher than 9 (Background), for example when using Dispatch, formerly branded Grand Central Dispatch (GCD). Running threads may also be moved periodically between P cores in the same cluster, and between clusters. Previous M-series chips appear to move threads less frequently, and may leave them to run to completion after several seconds on the same core, but threads appear to be considerably more mobile when running on M4 P cores.

VM4core4threadPcoresARes

This bar chart shows 4 threads from 4 virtual CPUs in a VM running for 3 seconds at 100% active residency. For almost all that period, the threads remain running on the 4 physical cores of the first P cluster in this M1 Max, with the second P cluster remaining idle for much of that time.

The following charts show 4 threads of intensive in-core floating point arithmetic running on the P cores of an M4 Pro.

m4threads1clusters

When viewed by cluster, those threads are loaded first onto the second P cluster (red bars), where they run for 0.4 seconds before being moved to the first cluster (pale blue bars). After running there for 1.3 seconds, they’re moved back to the second cluster for a further 1.3 seconds, before completing on the first cluster.

The next two bar charts show each cluster separately, illustrating thread mobility within them.

m4threads2cluster1

When running on the first cluster (above), threads appear to be moved to a different core approximately every 0.3 second, as they do when on the second cluster (below).

m4threads3cluster2

m4threads4frequency

Cluster frequency matches this movement, with each cluster being run up to maximum frequency (shown here averaged across the whole cluster) to process the threads running on its cores. The black line below those for the P clusters shows the small changes in average frequency for the E cluster over this period.

m4threads5power

This last chart shows the total CPU power use in mW over the same period. Of particular interest here is the consistent difference in power use reported by powermetrics between the two P clusters: the first (P0) used a steady 6,000 mW when running these four threads, whereas the second (P1) used slightly less, at 5,700-5,800 mW. That could be the result of measurement error in powermetrics, peculiar to this particular chip, or could reflect an underlying difference between the two clusters.

Thread mobility makes interpreting CPU History in Activity Monitor difficult, as the fastest frequency of sampling available there is every second, while powermetrics was sampling every 0.1 second when gathering the data above. As groups of threads may be moved between clusters every 1.3 seconds or so, this can give the impression that threads are being run on both clusters simultaneously. Once again, great care is needed when interpreting the data shown by Activity Monitor.

Key information

  • Current M4 chips offer 4-12 CPU P cores.
  • M4 P cores are arranged in clusters of up to 6, sharing L2 cache and running at a common frequency.
  • P core clusters can be shut down, idling at their minimum frequency of 1,260 MHz, or at one of 18 set frequencies up to a maximum of 4,512 MHz, as controlled by macOS.
  • Their instruction set is “technically” ARMv9.2-A, but without its Scalable Vector Extension (SVE).
  • They use 1-2 mW when idle, rising to peaks of 1,400 mW (floating point) or 3,230 mW (NEON vector code).
  • macOS preferentially allocates them threads at all QoS higher than 9 (Background).
  • Threads running on M4 P cores are mobile, and may be moved to another core in the same cluster frequently, and after just over a second may be transferred to a core in the other P cluster, when available.
  • Thread mobility makes interpretation of the CPU History window in Activity Monitor very difficult.

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

Last Week on My Mac: Mac mini M4 Pro first impressions, cores and more

By: hoakley
10 November 2024 at 16:00

My Mac mini M4 Pro arrived four days early, and swiftly exceeded all expectations. Everyone who has seen this new design remarks on how tiny it is, and I still keep looking at it, wondering how the smallest Mac I have ever owned is also by far the fastest. For those who collect numbers, it matched the growing collection in Geekbench’s database, with a single-core score of 3,892, a multi-core of 22,706, and Metal GPU of 110,960. Of course those new MacBook Pros with M4 Max chips are doing better with their extra two CPU cores and twice the number of GPU cores, but second best is still far superior to anything I’ve run before.

Not only is this my fastest Mac ever, but it was also the quickest and simplest to commission. It replaces my Mac Studio M1 Max, and now sits under my Studio Display using the Studio’s old keyboard and trackpad. As I’ll explain in a future article, I opted to migrate to the mini during initial setup using the Studio’s backup SSD, a sleek OWC Envoy Pro SX connected by Thunderbolt 3. Including the inevitable macOS update, the whole process took less than two hours, from the start of unboxing to the first full Time Machine backup.

With 48 GB memory and an internal SSD of 2 TB, I was keen to benchmark that once its initial backup was done. Although I have reduced confidence in these figures, there’s no doubt that this SSD is significantly quicker than the 2 TB in the Studio; how much quicker is open to debate. AmorphousDiskMark reported a write speed of 7.7 GB/s, while my own Stibium gave 10.3 GB/s across a broad range of file sizes. There was closer agreement on read speed, at around 6.8 GB/s. Both of those are comfortably above what I expect when I can finally get hold of an external Thunderbolt 5 SSD.

CPU cores

My M4 Pro is the full-performance variant, with 10 P and 4 E cores, set in three clusters: an E cluster with all four E cores, and two clusters of five P cores. Since its M3 chips, Apple has increased the maximum number of cores in a cluster from 4 (in M1 and M2 chips) to 6 (M3 and M4).

M4 P cores run at frequencies between 1260 and 4512 MHz (1.3-4.5 GHz), their maximum frequency being 111% that of the M3 P core, and 140% that of the M1 P core. M4 E cores have a narrower range of frequencies than those in the M3, between 1020-2592 MHz (1.0-2.6 GHz), so running at 137% of M3 frequency when running low QoS threads, and 94% of M3 when running at maximum frequency for high QoS threads spilt over from P cores. Those are going to make comparisons between E core performance very interesting indeed.

Single-thread and single-core comparisons of P core throughput are inevitably in the M4’s favour, with significantly better performance across integer, floating point and vector performance, as shown in the chart below.

M4M3multiTests

The Y axis here gives loop throughput per second for my four basic in-core performance tests, a tight assembly code integer math loop, another tight assembly code loop of floating point math, NEON vector processor assembly code, and a tight loop calling an Accelerate routine run in the NEON unit. Pale blue bars are results for the M1, purple for the M3, and red for the shiny new M4.

Although improvements in integer and floating point performance might appear small here, these results are for single threads. When scaled up across more P cores, the differences are magnified, as shown in the regressions below.

M4vM3floatTimevThreads

This plots total times taken to execute multiple threads, each consisting of 10^9 (one thousand million, or one billion) loops of floating point assembly code, against the number of threads, the same as the number of cores used, as each core runs one thread. According to the fitted linear regression equations, each thread/core of 10^9 loops takes the same period: 6.04 seconds for the M3 P cores, and 4.69 for the M4. Those equate to 1.66 x 10^8 loops/second per thread for the M3, and 2.13 x 10^8 for the M4. On that basis, the M4 performs at 128% that of the M3, significantly better than expected by the 111% increase in maximum core frequency. That requires the M4 to have improved processing speed for the same frequency of the M3, a frequency-independent improvement.

Core allocation

One striking difference between M4 CPU cores and those of previous Apple silicon chips is their pattern of core allocation, something you may notice even in Activity Monitor, for all its faults. This is best illustrated in its CPU History window.

For these two series of tests, I waited until the Mac was idling quietly, with little happening on its CPU cores. I then ran, in rapid succession, a series of my in-core floating point tests of 10^9 loops at high QoS, starting with a single thread, then two, and so on up to six or eight threads. This results in a pattern distinct from anything you’ll see in Intel cores, as I have shown here since the early days of the M1.

m3profloptcompo

This is the series seen on my MacBook Pro M3 Pro running macOS 15.2 beta, and is typical of M1, M2 and M3 chips. Although the cores are out of order here, these are the six cores in the chip’s single P cluster. At the left is the obvious peak from 1 thread running on Core 9, then the 2 threads of the next test appear on cores 9 and 12. Three fully occupy 7, 9 and 12, with a little spilt onto 8, and so on until all six are fully occupied with 6 threads.

Activity Monitor is a little too crude to see how distinct this is, for which you have to resort to shorter sampling periods of 0.1 second, and the finer detail of active residency reported by powermetrics. Those confirm that, much of the time, these heavy in-core compute loads run in a single thread on a single core, and if they are moved around, it’s relatively infrequently. E cores are different, though.

Compare that with a similar sequence, this time for 1-8 threads, on the M4 Pro. Its ten P cores are divided into two clusters, which I’ve separated here with the blue line, although within each cluster the cores aren’t in numeric order.

m4profloptcompo

Reading again from the left, a single thread is run in cores 6 and 8, with a little in 14 in the second cluster. Two threads are run in all five cores of the second cluster, with a little at the end from those of the first cluster. Three threads are similar, with significant contributions from all ten cores, and from then on they are similar.

So, in the M4 P cores threads are no longer allocated to single cores, and it appears from the CPU History window that they may even be allocated across more than one cluster. In fact, when the detailed active residency and frequency data from powermetrics are used, while threads are often moved between two cores in the same cluster, macOS will normally avoid unnecessarily running threads on other clusters, although it will happily move all threads from one cluster to another.

The rationale behind running all threads within the same cluster is clear, as CPU core frequencies of all cores in the same cluster are the same. Constraining threads to a single cluster thus allows others to idle and use less energy. While that continues in M4 chips, it isn’t clear why threads are moved between cores so frequently, or moved together to another cluster, and whether that brings improvements in performance.

独家首测:全新 Mac mini 的 Type-C 接口能供电了?实测后,我们发现了苹果的秘密

By: 肖凡博
7 November 2024 at 22:00

全新 Mac mini 即将在明天开售,这台史上最小的苹果电脑主机,个头不大,来头却不小。

在体积减小了 40% 多的同时,还能把 CPU 和 GPU 都往上提了提,更重要的是,加上教育优惠,你能用不到四千块,买到一台搭载 M4 芯片、16GB 内存起步的苹果电脑。

我们也在第一时间上手体验了这台苹果全新的「性能小钢炮」。

除开 Mac mini 的常规评测,我们这回还对它的 USB-C 接口有了兴趣,一方面,Mac mini 把两个常用的 Type-C 接口放到了正前方,以后拔插数据充电线变得非常方便;另外,苹果官网的一处改动引起了我们的注意。

苹果先前在官网里,曾把新款 iMac 与 Mac mini 放在了 70W、96W 和 140W 的 USB-C 电源适配器兼容性列表中,不过很快就将其删除,表明新款的 iMac 与 Mac mini 现在并不支持通过 USB-C 端口供电。

不过我们仍然很好奇:全新 Mac mini 是不是真的支持通过 Type-C 接口受电?如果支持,那是被什么限制了 Type-C 接口受电的能力?苹果官网的新闻,到底是编辑错误,还是不小心泄露了「秘密」?

为了搞清这次「官网乌龙」事件,爱范儿第一时间拿到了 Mac mini M4 Pro,并在制糖工厂的明日实验室,用专业的仪器和软件,进行了详细测试。

▲制糖工厂明日实验室测试现场

先说结论:

  • Mac mini M4 机身背面的三个雷雳 4/5 端口,均不能作为受电端口,也就是不能通过雷雳口给 Mac mini 供电。
  • Mac mini M4 机身前面的两个 USB-C 端口(支持 USB 3,速率最高可达 10Gb/s),虽然目前均不能作为受电端口,但苹果的确做好了 Type-C 接口受电在硬件层面的工作(至少是做完了一部分),这个端口其实是所谓的 dual role port,即可以对外供电也可以向内送电。

关于 Type-C 接口的那些事儿

在开始实验前,我们需要先了解几个小众知识,我们从这次事件的主角讲起。虽然长得一样,但是不一定所有的 Type-C 接口都能用于供电。

全新 Mac mini 一共有五个 Type-C 接口。

机身背面的 Type-C 接口的协议是雷雳 (Thunderbolt) 5 / USB 4,是 Intel 与苹果合作研发的一种高速数据传输接口,它能把数据、视频、音频和电力的传输集合到一个单独的接口上。

相比于普通的 USB-C,雷雳接口在以下几个方面有显著的优势:

  • 更高的数据传输速率
  • 更强的视频输出能力
  • 支持 PCI-e 数据传输
  • 更高的电力输送能力
  • 更好的兼容性和扩展性

但是我们想强调,并不是所有的雷雳接口都能够实现以上功能,有些设备上的雷雳接口只能用于数据传输或显示输出,比如刚刚发布的 Mac mini M4 就是如此。

另外两个放在正面的 Type-C 接口支持 USB 3,其最大的特点是传输速率最高可达 10Gb/s。

实测下来,新 Mac mini 的 Type-C 接口同样不支持为主机供电,也就是说现在想让 Mac mini 通电开机,只能通过主机上的传统电源接口。

而问题,也出在了这里。

有协议,但没达成的 Mac mini

随着 Type-C 接口在各种消费电子设备的普及,USB-IF(爱范儿也是 USB-IF 的成员之一)也发布了 USB PD 协议规范。苹果自 iPhone 8 开始,就一直使用的是 USB Power Delivery 协议受电。

▲ 图片来自:Google

USB PD 协议允许设备在多个方向上高效传输电力,并支持多种功率级别和电压配置,从而满足不同设备的需求。

简单来说,协议的作用是让特定品牌的受电和供电设备,达成一致的目标,我想要更快的充电速度更高的电压,同时你也得同意才行。

那么,供电和受电双方想要达成一次快充,有几个非常重要的事儿。

  • 设备必须配备 USB Type-C 接口,因为标准的 USB PD 协议通过 USB Type-C 接口实现。
  • 充电器、设备电源管理芯片和控制器必须支持 USB PD 协议。
  • 受电设备(如笔记本电脑、智能手机等)必须能够接收和处理 USB PD 信号。

我们在实验室里给新款 Mac mini 前后 USB-C 端口依次尝试接上制糖工厂 IonBridge 可编程 PD 电源后,设备都没法正常运行,按下开机键没有任何反应。

这是意料之中的结果,然而我们在设备前面两个 USB-C 端口连接电源时,发现了一些有趣的现象。

▲制糖工厂明日实验室测试现场

供受电设备在达成 PD 协议通讯的过程中,会相互交换数据,可以理解为两者在互查户口,看看暗号是否能对上、个人信息是否准确,彼此是不是找对了人。

而这个通讯的过程,可以用 Cypress CY4500 EZ-PD 协议分析器抓包,以分析 PD 协议的相互通讯过程。

按理来说,如果双方无法达成协议,则不能进行正常的通讯,没有通讯过程和数据传输,EZ-PD 协议分析器也就不会抓取到任何的数据信息。

▲CY4500 EZ-PD 协议分析器成功捕捉通讯过程数据包

而我们在测试中发现,EZ-PD 协议分析器捕捉到了制糖工厂 IonBridge 可编程 PD 电源的 Source Capabilities 广播,供应了 5V 电压。

根据 USB PD 3.2 规范的 8.3.3.2 Policy Engine Source Port State Diagram 之规定,在受电设备 (sink) 插入后,供电设备 (source) 的 Policy Engine 应该进入 PE_SRC_Send_Capabilities 状态,同时在 Vbus 供应 vSafe5V 电压。

因此可以证明,Mac mini 的前置端口硬件设计上满足了 USB PD 的受电设备 (sink) 的要求。然而,Mac mini 并未对 Source Capabilities 进行任何响应。从 IonBridge 的内置 debug 接口上来看,也是证明了 Source Capabilities 报文广播后未响应 (SourceCapabilityTimer timeout) 后重新广播 Source Capabilities。

对于上述较为专业的技术过程,我们专门做了一张简单易懂的流程图,并且标明了充电流程具体有哪些阶段,简单说跑完图中的流程,充电就能正常进行。

所以,Mac mini M4 Pro 机身正面的 USB-C 端口,有可能后续支持 PD 协议。这说明,苹果在 USB-C 电源适配器的兼容性列表中移除了新款 Mac mini,说不定不是编辑错误无中生有,而是不小心把今后的计划公布出来了。

验证与结论

虽然我们通过制糖工厂 IonBridge 可编程 PD 电源和 Cypress CY4500 EZ-PD 成功捕捉到了含有 PD 快充协议芯片的电源和 Mac mini M4 Pro 的 PD 报文。

但从数据来看,此次通讯只有制糖工厂 IonBridge 可编程 PD 电源在发出广播,受电设备没有对此回应,有点唱独角戏的感觉。

因此目前为止,电源仍然无法通过 USB-C 端口向 Mac mini M4 Pro 供电。

到这里还完,因为实验没有对照就不算严谨,也并不完整。

之后我们还用相同的设备和软件,对新款 Mac mini 后面的雷雳接口进行了测试,结果如下:

CY4500 EZ-PD 在接通制糖工厂 IonBridge 可编程 PD 电源后,没有捕捉到任何通讯的数据包。从制糖工厂 IonBridge 可编程 PD 电源的内部 Type-C phy 状态上看,也没有识别 CC Pin 上 Rd 的阻值。

▲ 图片来自:Google

这也刚好证明了两件事情:

  • 直接证明 Mac mini M4 Pro 后方的三个雷雳端口没有 PD 的物理支持能力,不能用于给 Mac mini M4 Pro 供电。
  • 间接证明了设备前方的 USB-C 端口的潜在的受电能力。

实验和结论到这里差不多就结束了,不过我们仍想强调:

我们目前没有拆开机器,还不能得知设备中是否有相应的电源管理芯片,使得设备有能力协商 PD。

我们目前也不清楚前置端口是否有对应的升降压电路,能够从正面的 USB-C 取得正确的电压,并且 feed 到整机的 power rail 给整机供电。

目前未加电状态下,背面 Type-C 接口的 CC Pin 对地阻值为 580Ohm,不符合 PD 要求的 5.1k Rd 要求。

一个可能的结果是,在苹果的研发过程中,此机可能曾经支持过 PD 协议,但由于某些原因,此支持被砍掉了,或者是由于工期问题,硬件完成而软件未完成。

利好策略,多来点

其实,关于「Mac mini 能否用 Type-C 接口供电」的话题,早已不是什么新鲜事儿,前几年就有博主对老款 Mac mini 进行了魔改,让其体积缩小了 1/3,还能直接通过充电宝供电。

▲左侧为魔改后的 Mac mini Go,可以通过充电宝为其供电. 图片来自:B 站 up 主 @Chrisroom

既然苹果对 Mac mini 的期许是「小形态+高性能」,那大家就总想探寻它的体积下限究竟在哪里,毕竟在掌机一般大小的主机市场里,苹果的性能和体验,应该算是独一份儿的。

我们的这次实验,也正是在满足自己和大家的好奇心,看看那个理想中真正的「移动电脑主机」究竟来了没。

现在结论也很明确:还没来,不过快了。

而苹果这回在官网的乌龙事件,也在无意中暴露了它们的计划。在 Mac mini 上被捕捉的通讯协议包,看似是一种偶然,但其实是从侧面证明了,这台 Mac mini,造的非常苹果。

无论苹果的产品有多少的质疑和槽点,你都不得不承认的一点是,这是一家打着长期主义旗号的企业。它们的许多做法可能不会被理解,不过都是在为未来布局,

2017 年,苹果在 iPhone X 机型上率先采用一体式的 L 形电池,通过新形状的形态优点,有效利用了手机的空间,同时也保持设备的轻薄性和高效能。

▲ 图片来自:iFixit

在 iPhone X 上量产且大批量搭载后,被苹果验证为可行的电池形态方案,直到最新发布的 iPhone 16 系列机型上,仍然延续着这种设计。

两个月前,iPhone 16 系列正式发布,苹果将多年不变的 6GB 运行内存,改为 8GB 标配,全新的 Mac mini 也从 8GB 升级为 16GB,为的是更好地在苹果设备上,运行 Apple Intelligence 的各项功能。

▲ 图片来自:TechCrunch

虽然到目前为止,发布会上的功能还没有一个用户完全体验到,国行版的用户甚至还没有见过它的身影。

但长期主义促使苹果要把一些「不可改变」的事项先行,软件和系统的问题可以通过 OTA 升级,但硬件不行,也不现实,所以我们会在某些苹果产品上看到一点相较于以前,更加超前的配置。

即使我们还用不到苹果智能,但更大的运行内存也同时提升了日常的用机体验,后台被杀的次数会进一步降低;即使没什么人会用 Type-C 接口给 Mac mini 充电,但未来的某一天系统支持后,mini 的使用场景又会被脑洞大开的用户,玩出新的花样。

多想一点、多做一点,于苹果而言能增加用户粘性,硬件有足够的冗余空间,用户的换机周期就能再长一点;于用户来说多年前的旧手机还能体验到新功能,并且二手市场的价格也能有足够的保障。

利好双方的产品路线,苹果在做,也希望再多做一点。

#欢迎关注爱范儿官方微信公众号:爱范儿(微信号:ifanr),更多精彩内容第一时间为您奉上。

爱范儿 | 原文链接 · 查看评论 · 新浪微博


Which M4 chip and model?

By: hoakley
7 November 2024 at 15:30

In the light of recent news, you might now be wondering whether you can afford to wait until next year in the hope that Apple then releases the M4 Mac of your dreams. To help guide you in your decision-making, this article explains what chip options are available in this month’s new M4 models, and how to choose between them.

CPU core types

Intel CPUs in modern Macs have several cores, all of them identical. Whether your Mac is running a background task like indexing for Spotlight, or running code for a time-critical user task, code is run across any of the available cores. In an Apple silicon chip like those in the M4 family, background tasks are normally constrained to efficiency (E) cores, leaving the performance (P) cores for your apps and other pressing user tasks. This brings significant energy economy for background tasks, and keeps your Mac more responsive to your demands.

Some tasks are normally constrained to run only on E cores. These include scheduled background tasks like Spotlight indexing, Time Machine backups, and some encoding of media. Game Mode is perhaps a more surprising E core user, as explained below.

Most user tasks are run preferentially on P cores, when they’re available. When there are more high-priority threads to be run than there are available P cores, then macOS will normally send them to be run on E cores instead. This also applies to threads running a Virtual Machine (VM) using lightweight virtualisation, whose threads will be preferentially scheduled on P cores when they’re available, even when code being run in the VM would normally be allocated to E cores.

macOS also controls the clock speed or frequency of cores. For background tasks running on E cores, their frequency is normally held relatively low, for best energy efficiency. When high-priority threads overspill onto E cores, they’re normally run at higher frequency, which is less energy-efficient but brings their performance closer to that of a P core. macOS goes to great lengths to schedule threads and control core frequencies to strike the best balance between energy efficiency and performance.

Unfortunately, it’s normally hard to see effects of frequency in apps like Activity Monitor. Its CPU % figures only show the percentage of cycles that are used for processing, and make no allowance for core frequency. It will therefore show a background thread running at low frequency but 100%, the same as a thread overspilt from P cores running at the maximum frequency of that E core. So when you see Spotlight indexing apparently taking 200% of CPU % on your Mac’s E cores, that might only be a small fraction of their maximum capacity if they were running at maximum frequency.

There are no differences between chips in the M4 family when it comes to each type of CPU core: each P core in a Base variant is the same as each in an M4 Pro or Max, with the same maximum frequency, and the same applies to E cores. macOS also allocates threads to different types of core using the same rules, and their frequencies are controlled the same as well. What differs between them is the number of each type of core, ranging from 4 P and 4 E in the 8-core variant of the Base M4, up to 12 P and 4 E in the 16-core variant of the M4 Max. Thus, their single-core benchmark results should be almost identical, although their multi-core results should vary according to the number of cores.

Game Mode

This mode is an exception to normal CPU and GPU core use, as it:

  • gives preferential access to the E cores,
  • gives highest priority access to the GPU,
  • uses low-latency Bluetooth modes for input controllers and audio output.

However, my previous testing didn’t demonstrate that apps running in Game Mode were given exclusive access to E cores. But for gamers, it now appears that the more E cores, the better.

GPU cores

These are also used for tasks other than graphics, such as some of the more demanding calculations required for Machine Learning and AI. However, experience so far with Writing Tools in Sequoia 15.1 is that macOS currently offloads their heavy lifting to be run off-device in one of Apple’s dedicated servers. Although having plenty of GPU cores might well be valuable for non-graphics purposes in the future, for now there seems little advantage for many.

Thunderbolt 5

M4 Pro and Max, but not Base variants, come equipped with Thunderbolt ports that not only support Thunderbolt 3 and 4, but 5, as well as USB4. Thunderbolt 5 should effectively double the speed of connected TB5 SSDs, but to see that benefit, you’ll need to buy a TB5 SSD. Not only are they more expensive than TB3/4 models, but at present I know of only one range that’s due to ship this year. There will also be other peripherals with TB5 support, including at least one dock and one hub, although neither is available yet. The only TB5 accessories that are already available are cables, and even they are expensive.

TB5 also brings increased video bandwidth and support for DisplayPort 2.1, although even the M4 Max can’t make full use of that. If you’re looking to drive a combination of high-res displays, consult Apple’s Tech Specs carefully, as they’re complicated.

Although TB5 will become increasingly important over the next few years, TB3/4 and USB4 are far from dead yet and are supported by all M4 models.

Which M4 chip?

The table below summarises key figures for each of the variants in the M4 family that have now been released. It’s likely that next year Apple will release an Ultra, consisting of two M4 Max chips joined in tandem, in case you feel the burning desire for 24 P and 8 E cores.

m4configs2

Models available next week featuring each M4 chip are shown with green rectangles at the right.

There are two variants of the Base M4, one with 4P + 4E and 8 GPU cores, the same as Base variants in M1 to M3 families. There’s also the more capable variant, for the first time with 4P + 6E, which promises to be a better all-rounder, and when in Game Mode. It also has an extra couple of GPU cores.

The M4 Pro also comes in two variants, this time differing in the number of P cores, 8 or 10, and GPU cores, 16 or 20. Those overlap with the M4 Max, with 10 or 12 P cores and 32 or 40 GPU cores. Thus the gap between M4 Pro and Max isn’t as great as in the M3, with the GPUs in the M4 Max being aimed more at those working with high-res video, for instance. For more general use, there’s little difference between the 14-core Pro and Max.

Memory and storage

Chips in the M4 family also determine the maximum memory and internal SSD capacity. Apple has at last eliminated base models with only 8 GB of memory, and all now start with at least 16 GB. Base M4 chips are limited to a maximum of 32 GB, while the M4 Pro can go up to 64 GB, and the 16-core Max up to 128 GB, although in its 14-core variant, the Max is only available with 36 GB (I’m very grateful to Thomas for pointing this out below).

Unfortunately, Apple hasn’t increased the minimum size of internal SSD, which remains at 256 GB for some base models. Smaller SSDs may be cheaper, but they are also likely to have shorter lives, as under heavy use their small number of blocks will be erased for reuse more frequently. That may shorten their life expectancy to much less than the normal period of up to 10 years, as was seen in some of the first M1 models. This is more likely to occur when swap space is regularly used for virtual memory. I for one would have preferred 512 GB as a starting point.

While Base M4 chips come with SSDs up to 2 TB in size, both Pro and Max can be supplied with internal SSDs of up to 8 TB.

I hope this proves useful in guiding your decision.

MacBook Pro 还有大更新?等等党又要赢了

By: 周奕旨
4 November 2024 at 17:47

上周,苹果按照一天一台的节奏,发布了首批搭载 M4 芯片的 Mac,分别是:

  • iMac(搭载 M4 芯片)
  • Mac mini(搭载 M4 与 M4 Pro 芯片)
  • MacBook Pro(搭载 M4、M4 Pro 与 M4 Max 芯片)

作为目前 Mac 家族当之无愧的明星,MacBook Pro 在经历泄露事件后热度不减,依旧作为压轴登场,而在大家期待的 MacBook Air 之前到来,更多关于 Mac 的爆料提前来了。

在这代 MacBook Pro 上,M4 系列芯片获得了更强的 NPU,为即将到来的 Apple Intelligence 做足了准备。

不过,根据 Mark Gurman 的爆料来看,M4 Max 还不算最顶——缺席了 M3 系列芯片的 Ultra 型号将会回归。

与它的前辈 M1 Ultra 和 M2 Ultra 相似,M4 Ultra 也将使用号称「胶水大法」的 UltraFusion 技术将两块 M4 Max 连接起来。

由此,M4 Ultra 最多将会配备 32 核 CPU 与 80 核 GPU 以及 32 核 NPU,在晶体管数量上也将达到 M4 Max 的两倍。

作为参考,M2 Ultra 在 Geekbench 6 多核测试中得分为 21241,而 M2 Max 的得分为 14621,M4 Ultra 也一样,虽然有边际效应的影响,但双倍芯片依旧能带来强大的性能提升。

由于第一批搭载 M4 芯片的 Mac 已经完成发布,M4 Ultra 更有可能出现在尚未发布的机型上,考虑到本身的定位,M4 Ultra 可能与 M2 Ultra 一样,首发于明年较为高端的 Mac Studio 与 Mac Pro 上,为 Apple Intelligence 提供充足的 NPU 以及光线追踪提供性能支持。

除了 M4 Ultra,后续的 M 系列芯片也有消息。

明年,苹果将会推出 M5 芯片,据推测,这将会是常规升级的一代,按照苹果的一贯作风,M5 可能只是对功耗、制程等方面进行进一步的补充与优化。

虽然 2025 年的 M5 芯片看起来可能有些乏善可陈,但 2026 年的 MacBook Pro 一定不会让大家失望。

根据爆料显示,2026 年的 MacBook Pro 将会有三个变化。

按照芯片制程进度,2026 年的 MacBook Pro 将搭载使用台积电 2nm 工艺制作的 M6 芯片。

相较于目前的 M4 芯片使用的 3nm 工艺,制程工艺的缩小意味着晶体管间的距离更小,这有两个好处:电流的损耗与泄漏会更小、以及可以在相同的空间中容纳更多的晶体管。

也就是说,M6 芯片的功耗会更低,性能会更高。

除了芯片,外观也可能会有变化。

按照苹果对 MacBook Pro 五年一更新的计划,的确也快到 MacBook Pro 换新设计的时候了。

上一次更新,苹果设计接班人 Evans Hankey 抛弃了上一代 Jony Ive 主导的楔形机身,将饱受争议的蝶式键盘与 Touch Bar 移除,SDXC 卡槽与 HDMI 接口的回归也揭示了功能性的全面提升。

这样的设计改动获得了大家的一致好评,在一定程度上被视作苹果「拨乱反正」的象征:功能性不再为设计让步,两者达到一个舒适的平衡。

但苹果显然还有更多想法:现在这一代 MacBook Pro 显得比较笨重,苹果想要将它做得更薄。

目前,设计了这代 MacBook Pro 的 Evans Hankey 已经于 2023 年初离职,接任她的 Molly Anderson 虽然已经成为苹果设计总管,但并没有像 Jony Ive 与 Evans Hankey 那样被任命为苹果副总裁。

她的设计代表作是今年 WWDC 上 M4 首秀的 iPad Pro,在她的带领下,iPad Pro M4 在实现了机身厚度变薄的同时,保住了电池容量,并提供了比以前好 20% 的散热性能。

如果不出意外,新 MacBook Pro 的设计也将由她主导,能否在更薄的机身下控制住桌面级元器件的散热,就要看其功力究竟如何了。

苹果今年很爱玩屏幕,比如在刚刚发布的 MacBook Pro 与 iMac 上,苹果就将首次出现于 iPad Pro M4 上的纳米涂层玻璃普及到 Mac 家族的显示屏上了。

在 2026 年,MacBook Pro 的屏幕将迎来整体更新,可能会由目前的 MiniLED 技术升级为 OLED 技术。

这种升级可以带来更高的对比度、更广的色域、更快的响应时间以及更薄的设计,提升视觉体验和电池续航,同时减少光晕效应并改善可视角度,使其更加适合创意和专业用户的需求,不过烧屏的风险也随之而来,能否撑得住 Pro 系列的专业用户使用,还有待观察。

随着 M4 芯片的推出,Apple Intelligence 已经拥有了一个可用的环境,只等落地,而在可预见的未来,进一步优化的 M5 芯片、更强大的 M6 芯片、新的机身设计和显示技术将相继面世,MacBook Pro 的发展步伐不可谓不快。

不过对于有迫切换机需求的朋友来说,当下的产品才是最重要的,爱范儿为目前发布的 M4 Mac 准备了三篇详细评测,哪一款 Mac 最值得购买?M4 芯片三种型号怎么选?在各个使用场景下表现怎么样?这些问题,我们将一次给你解决。

你会选择今年的 Mac 吗?还是做一个等等党,等待更香的型号?

(置入投票)

让我有个美满旅程

#欢迎关注爱范儿官方微信公众号:爱范儿(微信号:ifanr),更多精彩内容第一时间为您奉上。

爱范儿 | 原文链接 · 查看评论 · 新浪微博


Last Week on My Mac: M4 incoming

By: hoakley
3 November 2024 at 16:00

Almost exactly a year after it released its first Macs featuring chips in the M3 family, Apple has replaced those with the first M4 models. Benchmarkers and core-counters are now busy trying to understand how these will change our Macs over the coming year or so. Before I reveal which model I have ordered, I’ll try to explain how these change the Mac landscape, concentrating primarily on CPU performance.

CPU cores

CPUs in the first two families, M1 and M2, came in two main designs, a Base variant with 4 Performance and 4 Efficiency cores, and a Pro/Max with 8 P and 2 or 4 E cores, that was doubled-up to make the Ultra something of a beast with its 16 P and 4 or 8 E cores. Last year Apple introduced three designs: the M3 Base has the same 4 P and 4 E CPU core configuration as in the M1 and M2 before it, but its Pro and Max variants are more distinct, with 6 P and 6 E in the Pro, and 10-12 P and 4 E cores in the Max. The M4 family changes this again, improving the Base and bringing the Pro and Max variants closer again.

As these are complicated by sub-variants and binned versions, I have brought the details together in a table.

mcorestable2024

I have set the core frequencies of the M4 in italics, as I have yet to confirm them, and there’s some confusion whether the maximum frequency of the P core is 4.3 or 4.4 GHz.

Each family of CPU cores has successively improved in-core performance, but the greatest changes are the result of increasing maximum core frequencies and core numbers. One crude but practical way to compare them is to total the maximum core frequencies in GHz for all the cores. Strictly speaking, this should take into account differences in processing units between P and E cores, but that also appears to have changed with each family, and is hard to compare. In the table, columns giving Σfn are therefore simply calculated as
(max P core frequency x P core count) + (max E core frequency x E core count)

Plotting those sum core frequencies by variant for each of the four families provides some interesting insights.

mcoresbars2024

Here, each bar represents the sum core frequency of each full-spec variant. Those are grouped by the variant type (Base, Pro, Max, Ultra), and within those in family order (M1 purple, M2 pale blue, M3 dark blue, M4 red). Many trends are obvious, from the relatively low performance expected of the M1 family, except the Ultra, and the changes between families, for example the marked differences in the M4 Pro, and the M3 Max, against their immediate predecessors.

Sum core frequencies fall into three classes: 20-30, 35-45, and greater than 55 GHz. Three of the four chips in the M1 family are in the lowest of those, with only the M1 Ultra reaching the highest. The M4 is the first Base variant to reach the middle class, thanks in part to its additional two E cores. Two of the M4 variants (Pro and Max) have already reached the highest class, and any M4 Ultra would reach far above the top of the chart at 128 GHz.

Real-world performance will inevitably differ, and vary according to benchmark and app used for comparison. Although single-core performance has improved steadily, apps that only run in a single thread and can’t take advantage of multiple cores are likely to show little if any difference between variants in each family.

Game Mode is also of interest for those considering the two versions of the M4 Base, with 4 or 6 E cores. This is because that mode dedicates the E cores, together with the GPU, to the game being played. It’s likely that games that are more CPU-bound will perform significantly better on the six E cores of the 10-Core version of the iMac, which also comes with a 10-core GPU and four Thunderbolt 4 ports.

Memory and GPU

Memory bandwidth is also important, although for most apps we should assume that Apple’s engineers match that with likely demand from CPU, GPU, neural engine, and other parts of the chip. There will always be some threads that are more memory-bound, whose performance will be more dependant on memory bandwidth than CPU or GPU cores.

Although Apple claims successive improvements in GPU performance, the range in GPU cores has started at 8 and attained 32-40 in Max chips. Where the Max variants come into their own is support for multiple high-res displays, and challenging video editing and processing.

Thunderbolt and USB 3

The other big difference in these Macs is support for the new Thunderbolt 5 standard, available only in models with M4 Pro or M4 Max chips; Base variants still only support Thunderbolt 4. Although there are currently almost no Thunderbolt 5 peripherals available apart from an abundant supply of expensive cables, by the end of this year there should be at least one range of SSDs and one dock shipping.

As ever with claimed Thunderbolt performance, figures given don’t tell the whole story. Although both TB4 and USB4 claim ‘up to’ 40 Gb/s transfer rates, in practice external SSD performance is significantly different, with Thunderbolt topping out at about 3 GB/s and USB4 reaching up to 3.4 GB/s. In practice, TB5 won’t deliver the whole of its claimed maximum of 120 Gb/s to a single storage device, and current reports are that will only achieve disk transfers at 6 GB/s, or twice TB4. However, in use that’s close to the expected performance of internal SSDs in Apple silicon Macs, and should make booting from a TB5 external SSD almost indistinguishable in terms of speed.

As far as external ports go, this widens the gap between the M4 Pro Mac mini’s three TB5 ports, which should now deliver 3.4 GB/s over USB4 or 6 GB/s over TB5, and its two USB-C ports that are still restricted to USB 3.2 Gen 2 at 10 Gb/s, equating to 1 GB/s, the same as in M1 models from four years ago.

My choice

With a couple of T2 Macs and a MacBook Pro M3 Pro, I’ve been looking to replace my original Mac Studio M1 Max. As it looks likely that an M4 version of the Studio won’t be announced until well into next year, I’m taking the opportunity to shrink its already modest size to that of a new Mac mini. What better choice than an M4 Pro with 10 P and 4 E cores and a 20-core GPU, and the optional 10 Gb Ethernet? I seldom use the fourth Thunderbolt port on the Studio, and have already ordered a Kensington dock to deliver three TB5 ports from one on the Mac, and I’m sure it will drive my Studio Display every bit as well as the Studio has done.

If you have also been tempted by one of the new Mac minis, I was astonished to discover that three-year AppleCare+ for it costs less than £100, that’s two-thirds of the price that I pay each year for AppleCare+ on my MacBook Pro.

I look forward to diving deep into both my new Mac and Thunderbolt 5 in the coming weeks.

苹果发布 M4 MacBook Pro!全系 16G 内存起步,史上续航最长的 Mac

By: 周奕旨
31 October 2024 at 00:15

10 月初,全新的 MacBook Pro 在尚未公开前,直接被几位博主提前开箱上手,堪称苹果自 iPhone 4 以来最严重的产品泄露事件。

原本以为苹果会把发布会提前举办,接住这泼天的流量。结果,苹果跟赌气似的选择跳过发布会,还一天发一件,终于在 iMac 与 Mac mini 之后,全新 MacBook Pro 携 M4 全系列芯片登场。

还是熟悉的配方,这代 MacBook Pro 在外观上并没有太大变化,我们直接将目光先聚焦在今天的重点上——M4 系列芯片。

与 iMac 不同,MacBook Pro 上使用的 M4 芯片均为满血状态,配备 10 核 CPU 与 10 核 GPU,与昨天发布的 Mac mini 对齐。

搭载 M4 标准版芯片的 MacBook Pro 在编辑照片等任务上比配备 M1 的 13 英寸 MacBook Pro 快 1.8 倍,在 Blender 中执行渲染任务时,速度可达 3.4 倍。

搭载 M4 标准版的 MacBook Pro 另一个变化,就是同样起步就配备 16GB 内存,最高支持 32GB。

面对与日俱增的 AI 需求,电脑对内存的需求也在不断攀升。全力押注 AI 的苹果也到了不得不违背「祖宗之法」的时候了。

此前,搭载 M3 芯片的 MacBook Pro 在闭合的情况下,支持一台分辨率最高达 6K (60Hz) 的外接显示器,以及另一台分辨率最高达 5K (60Hz) 的外接显示器。

现在好了, M4 MacBook Pro 开盖也能连两台显示器。

昨天出现在 Mac mini 上的 M4 Pro 继续登场,最多集成 14 核 CPU 与 20 核 GPU 以及 16 核 NPU,M4 Pro 性能与配备 M1 Pro 的 16 英寸 MacBook Pro 相比,在 Mason Redshift 中渲染场景的性能高达 3 倍。

同时,M4 Pro 对应的起步内存与 M4 芯片也拉开了差距,升级到 24GB,最高支持 48G 内存。

为了 AI,苹果还想了别的办法——搭载了 M4 Pro 的 MacBook Pro 的内存带宽比上一代大幅增加了 75%——这是任何 AI PC 使用的芯片的两倍。

除了前两天已经发布的 M4 标准版与 M4 Pro,最大杯 M4 Max 也终于在今天亮相。

按照惯例,M4 Max 与 M4 Pro 一样分为高低两个配置,低配 14 核 CPU、32 核 GPU、16 核 NPU;高配 16 核 CPU、40 核 GPU 以及 16 核 NPU。

M4 Max 芯片版本的 MacBook Pro 内存从 36GB 起步,最高支持选配 128GB。

M4 Max 的性能也稳步提升,据苹果宣称,M4 Max 在 Maxon Redshift 场景渲染中的性能最高可达 M1 Max 的 3.5 倍,可轻松完成视觉效果、3D 动画和电影配乐等繁重的创意工作量,在 Xcode 中编译代码时,构建性能速度也能高达 2.2 倍。

在人工智能方面,M4 Max 的提升较为明显,可以提供比 M1 Max 快 3 倍以上的神经引擎,配合增大的内存,开发人员可以轻松与具有近 2000 亿个参数的 LLM 交互。

在拓展方面,新 MacBook Pro 也变得更为强大,在 M4 Max 加持下,可以在开盖的情况下,最多外接 4 个额外显示器。

除了核心性能,M4 MacBook Pro 的接口也有所升级。

之前的 M3 MacBook Pro 入门级配备 2 个雷雳 4 接口,现在 M4 MacBook Pro 将从 3 个雷雳 4 接口起步,与更高配置机型看齐,Pro 与 Air 的界限进一步分明;

M4 Pro 与 M4 Max 机型则配备昨天在 Mac mini 上首次出现的雷雳 5 接口,可以实现高达 120 Gb/s 的数据传输速度,吞吐能力是雷雳 4 接口的 2 倍以上。

在新的 MacBook Pro 上,苹果引入了全新的纳米纹理面板,可以在明亮环境中显示高达 1000 尼特的 SDR 内容,并以 1600 尼特的峰值亮度显示 HDR 内容,屏幕旁边的摄像头也升级了——一个支持自动将用户居中的 1200 万像素的摄像头,同时也支持「桌上视角」功能。

最后,苹果宣称新 MacBook Pro 的续航最长可达 24 个小时,并且支持快充,最快只需要 30 分钟就可以充电 50%,续航烦恼进一步得到解决。

苹果向来钟爱于深空灰和银色这类具有光泽感的配色,去年发布的深空黑配色却打破常规,黑得更为深邃,呈现一种强烈而沉稳的金属质感。

以往深空黑色专属于 M3 Pro 与 M3 Max,现在,最便宜的 M4 MacBook Pro 也能享受到这种高级感了。

M4 系列的 MacBook Pro 依然分为 14 和 16 英寸两种规格,其中 14 英寸可选择 M4、M4 Pro、M4 Max 三种性能版本,而 16 英寸可选择 M4 Pro 或 M4 Max 两种性能版本。这样的方案与前代保持一致。

最后,公布一下价格:

14 寸 M4 MacBook Pro :

  • 搭载 M4 芯片起售价为 12999 元
  • 搭载 M4 Pro 芯片的起售价为 16999 元
  • 搭载 M4 Max 芯片的起售价为 26999 元

16 寸 M4 MacBook Pro :

  • 搭载 M4 Pro 芯片的起售价为 19999 元
  • 搭载 M4 Max 芯片的起售价为 27999 元

搭载 M4 系列芯片的 MacBook Pro 将于 11 月 1 日上午 9 点接受预购,11 月 8 日正式发售。

2020 年底,苹果自研的 M1 芯片登场,彼时,大家聊得最多的,还是大数据、云计算和元宇宙。

四年过去,AI 成了目前最炙手可热的话题。

话题永远与产品息息相关,作为苹果最专业的便携终端,MacBook Pro 起步内存集体「升杯」,显然已经为迎接 AI 时代做好了准备。

不过,由于国行版本的 AI 功能的暂时缺席,M4 MacBook Pro 更大的意义也许在于,机身终于来到一个相较前代,定位更清晰、结构更合理、功能更齐全的舒适区。

你会选择 M4 MacBook Pro 吗?你觉得哪个版本是性价比之选?

欢迎在评论区给出你的看法。

让我有个美满旅程

#欢迎关注爱范儿官方微信公众号:爱范儿(微信号:ifanr),更多精彩内容第一时间为您奉上。

爱范儿 | 原文链接 · 查看评论 · 新浪微博


❌
❌