Reading view

There are new articles available, click to refresh the page.

Tune for Performance: Core types

17 December 2024 at 15:30

When running apps on Intel Macs, because all their CPU cores are identical, the best you can hope for is that tasks make use of the number of cores available by running in multiple threads. Even single-threaded processes running on Apple silicon Macs get a choice of CPU core type, and this article explains the limited control you have over that.

m4coremanagement1

In my current model of CPU core allocation by macOS, shown in part above, Quality of Service (QoS) is the factor determining which core type a thread will be allocated to. QoS is normally hard-coded into an app, and seldom exposed to the user’s control.

If a thread is assigned a minimal QoS of Background (value 9), or less, then macOS will allocate it to be run on E cores, and it won’t be run on a P core. With a higher QoS, macOS allocates it to a P core if one is available; if not, then it will be run on E cores, with that cluster running at high frequency. Thus, threads with a higher QoS can be run on either P or E cores, depending on their availability.

`taskpolicy`

There are times when you might wish to accelerate completion of threads normally run exclusively on E cores. For example, knowing that a particular backup might be large, you might elect to leave the Mac to get on with that as quickly as possible. There are two methods that appear intended to change the QoS of processes: the command tool taskpolicy, and its equivalent code function setpriority().

Experience with using those demonstrates that, while they can be used to demote threads to E cores, they can’t promote processes or threads already confined to E cores so that they can use both types. For instance, the command
taskpolicy -B -p 567
that should promote the process with PID 567 to run on both types of core, has no effect on processes or their threads that are run at low QoS. taskpolicy can be used to demote processes and their threads with higher QoS to use only E cores, though. Running the command
taskpolicy -b -p 567
does confine all threads to the E cluster, and can be reversed using the -B option for threads with higher QoS (but not those set to low QoS by the process).

qoscores1

That can be seen in this CPU History window from Activity Monitor. An app has run four threads, two at low QoS and two at high QoS. In the left side of each core trace they are run on their respective cores, as set by their QoS. The app’s process was then changed using taskpolicy -b and the threads run again, as seen in the right. The two threads with high QoS are then run together with the two with low QoS in the four E cores alone.

The best way to take advantage of this ability to demote high QoS threads to run them on E cores is in St. Clair Software’s excellent utility App Tamer.

Virtualisation

macOS virtual machines running on Apple silicon chips are automatically assigned a high QoS, and run preferentially on P cores. Thus, even when running threads at low QoS, those are run within threads on the host’s P cores. This remains the only known method of electively running low QoS threads on P cores.

Game Mode, CPU cores

In the Info.plist of an application, the developer should assign that app to one of the standard LSApplicationCategoryTypes. If that’s one of the named game types, macOS automatically gives that app access to Game Mode. Among its benefits are “highest priority access” to the CPU cores, in particular the E cores, whose use by background threads is reduced. Other benefits include highest priority access to the GPU, and doubled Bluetooth sampling rate to reduce latency for input controllers and audio output. Game Mode is automatically turned on when the app is put into full-screen mode, and turned off when full-screen mode is stopped, although the user also has manual control through the mode’s item in the menu bar.

In practice, while this is beneficial to many games, it has little if any use for modifying core type allocation in other circumstances. As the user can’t modify an app’s Info.plist without breaking its signature and notarization, this is only of use to developers.

Summary

The QoS assigned by an app to its threads is used by macOS to determine which core type to allocate them to.
Threads with a low (Background, 9) QoS are run exclusively on E cores. Those with higher QoS are run preferentially on P cores, and normally only on E cores when no P core is available.
App Tamer and taskpolicy can be used to demote higher QoS threads to be run on E cores, but low QoS threads can’t be promoted to be run on P cores.
macOS VMs run on P cores, so low QoS threads running in a VM will be run on P cores, and can be used to accelerate Background threads.
Game Mode gives games priority access to CPU cores, particularly to E cores, but users can’t elect to run other apps in Game Mode, and there don’t appear to be benefits in doing so.
If you feel an app would benefit from user control over CPU core type allocation through access to their QoS, suggest it to the app’s developer.

Explainer

Quality of Service (QoS) is a property of each thread in a process, and normally chosen from the macOS standard list:

QoS 9 (binary 001001), named background and intended for threads performing maintenance, which don’t need to be run with any higher priority.
QoS 17 (binary 010001), utility, for tasks the user doesn’t track actively.
QoS 25 (binary 011001), userInitiated, for tasks that the user needs to complete to be able to use the app.
QoS 33 (binary 100001), userInteractive, for user-interactive tasks, such as handling events and the app’s interface.

There’s also a ‘default’ value of QoS between 17 and 25, an unspecified value, and in some circumstances you might come across others.

A brief history of Activity Monitor

The Eclectic Light Company

hoakley

14 December 2024 at 16:00

In the days of Classic Macs, most of our concerns centred on memory rather than CPU, disk or network performance. Tools for managing memory flourished, and I don’t recall any utility provided in Mac OS that looked more broadly. That changed when Mac OS X arrived, and we started taking an interest in Process Viewer and CPU Monitor.

cpumonitor2001

CPU Monitor, seen here in 2001 with its two floating chart views, was a first step in development. I suspect this was taken on my QuickSilver Power Mac G4 with its dual processors, hence the pair of CPU charts.

processviewer2001

Process Viewer is a more obvious ancestor of the modern Activity Monitor, with its list of processes in a column view. Note the surprisingly few system processes running at the time: only 34, compared with today’s list of several hundred.

activitymonitor2005

Activity Monitor integrated those features into its new app in Mac OS X 10.3 Panther in 2003, and is shown here in 10.4 Tiger from a couple of years later, where it’s surprisingly similar to the current app. There are still only 72 processes running, though, and most have less than 5 threads.

activitymonitor2011

Here’s Activity Monitor with its CPU History window on an 8-core Mac Pro, running Mac OS X 10.7 Lion in 2011. The number of processes is still only growing slowly, and had reached 89.

instruments2012a

Xcode’s Instruments added an Activity Monitor template for developers wanting to monitor further details as they tested and debugged their code. At this stage it was limited to presenting the same fundamental measurements as were available in Activity Monitor, but these have since flourished into much greater depth.

instruments2012b

Two years later, in 2013 for OS X 10.9 Mavericks, Activity Monitor underwent revision to add a new tab for Energy. Unlike other panes, the ‘Energy Impact’ is given in arbitrary units rather than calculating Joules from power use over time. It was at about this time that the app also added a Cache pane analysing the performance of the Content Caching service, when enabled on the Mac.

activitymonitor2020i

Until 2020, Activity Monitor had dealt with CPUs with uniform cores. Above are the eight physical and eight Hyper-threaded cores in an 8-core Intel Xeon W in an iMac Pro from 2017, running a heavy load of over 700% CPU. With the first Apple silicon Macs, it had to display CPU use for two different types of core. Note how, by 2020, the total number of processes has shot up to 458.

activitymonitor2021m1

This is an example from one of the first base M1 Macs, with 4 Efficiency and 4 Performance cores displayed neatly in its CPU History window. Although Activity Monitor doesn’t take core frequency into account when measuring % CPU, and can’t display cluster frequencies, it remains one of the essential tools for everyone, whichever age and architecture their Mac.

Tune for Performance: Activity Monitor’s CPU view

The Eclectic Light Company

hoakley

11 December 2024 at 15:30

Activity Monitor is one of the unsung heroes of macOS. If you understand how to interpret its rich streams of measurements, it’s thoroughly reliable and has few vices. When tuning for performance, it should be your starting point, a first resort before you dig deeper with more specialist tools such as Xcode’s Instruments or powermetrics.

How Activity Monitor works

Like Xcode’s Instruments and powermetrics, the CPU view samples CPU and GPU activity over brief periods of time, displays results for the last sampling period, and updates those every 1-5 seconds. You can change the sampling period used in the Update Frequency section of the View menu, and that should normally be set to Very Often, for a period of 1 second.

This was normally adequate for many purposes with older M-series chips, but thread mobility in M4 chips can be expected to move threads from core to core, and between whole clusters, at a frequency similar to available sampling periods, and that loses detail as to what’s going on in cores and clusters, and can sometimes give the false impression that a single thread is running simultaneously on multiple cores.

However, sampling frequency also determines how much % CPU is taken by Activity Monitor itself. While periods of 0.1 second and less are feasible with powermetrics, in Activity Monitor they would start to affect its results. If you need to see finer details, then you’ll need to use Xcode Instruments or powermetrics instead.

% CPU

Central to the CPU view, and its use in tuning for performance, is what Activity Monitor refers to as % CPU, which it defines as the “percentage of CPU capability that’s being used by processes”. As far as I can tell, this is essentially the same as active residency in powermetrics, and it’s vital to understand its strengths and shortcomings.

Take a CPU core that’s running at 1 GHz. Every second it ‘ticks’ forward one billion times. If an instruction were to take just one clock cycle, then it could execute a billion of those every second. In any given second, that core is likely to spend some time idle and not executing any instructions. If it were to execute half a billion instructions in any given second, and spend the other half of the time idle, then it has an idle residency of 50% and an active residency of 50%, and that would be represented by Activity Monitor as 50% CPU. So a CPU core that’s fully occupied executing instructions, and doesn’t idle at all, has an active residency of 100%.

To arrive at the % CPU figures shown in Activity Monitor, the active residency of all the CPU cores is added together. If your Mac has four P and four E cores and they’re all fully occupied with 100% active residency each, then the total % CPU shown will be 800%.

There are two situations where this can be misleading if you’re not careful. Intel CPUs feature Hyper-threading, where each physical core acquires a second, virtual core that can also run at another 100% active residency. In the CPU History window those are shown as even-numbered cores, and in % CPU they double the total percentage. So an 8-core Intel CPU then has a total of 16 cores, and can reach 1600% when running flat out with Hyper-threading.

coremanintel

cpuendstop

The other situation affects Apple silicon chips, as their CPU cores can be run at a wide range of different frequencies according to control by macOS. However, Activity Monitor makes no allowance for their frequency. When it shows a core or total % CPU, that could be running at a frequency as low as 600 MHz in the M1, or as high as 4,512 MHz in the M4, nine times as fast. Totalling these percentages also makes no allowance for the different processing capacity of P and E cores.

Thus an M4 chip’s CPU cores could show a total of 400% CPU when all four E cores are running at 1,020 MHz with 100% active residency, or when four of its P cores are running at 4,512 MHz with 100% active residency. Yet the P cores would have an effective throughput of as much as six times that of the E cores. Interpreting % CPU isn’t straightforward, as nowhere does Activity Monitor provide core frequency data.

tuneperf1

In this example from an M4 Pro, the left of each trace shows the final few seconds of four test threads running on the E cores, which took 99 seconds to complete at a frequency of around 1,020 MHz, then in the right exactly the same four test threads completed in 23 seconds on P cores running at nearer 4,000 MHz. Note how lightly loaded the P cores appear, although they’re executing the same code at almost four times the speed.

Threads and more

For most work, you should display all the relevant columns in the CPU view, including Threads and GPU.

tuneperf2

Threads are particularly important for processes to be able run on multiple cores simultaneously, as they’re fairly self-contained packages of executable code that macOS can allocate to a core to run. Processes that consist of just a single thread may get shuffled around between different cores, but can’t run on more than one of them at a time.

Another limitation of Activity Monitor is that it can’t tell you which threads are running on each core, or even which type of core they’re running on. When there are no other substantial threads active, you can usually guess which threads are running where by looking in the CPU History window, but when there are many active threads on both E and P cores, you can’t tell which process owns them.

You can increase your chances of being able to identify which cores are running which threads by minimising all other activity, quitting all other open apps, and waiting until routine background activities such as Spotlight have gone quiet. But that in itself can limit what you can assess using Activity Monitor. Once again, Xcode Instruments and powermetrics can make such distinctions, but can easily swamp you with details.

GPU and co-processors

Percentages given for the GPU appear to be based on a similar concept of active residency without making any allowance for frequency. As less is known about control of GPU cores, that too must be treated carefully.

Currently, Activity Monitor provides no information on the neural processing unit (ANE), or the matrix co-processor (AMX) believed to be present in Apple silicon chips, and that available using other tools is also severely limited. For example, powermetrics can provide power use for the ANE, but doesn’t even recognise the existence of the AMX.

Setting up

In almost all cases, you should set the CPU view to display All processes using the View menu. Occasionally, it’s useful to view processes hierarchically, another setting available there.

Extras

Activity Monitor provides some valuable extras in its toolbar. Select a process and the System diagnostics options tool (with an ellipsis … inside a circle) can sample that process every millisecond to provide a call graph and details of binary images. The same menu offers to run a spindump, for investigating processes that are struggling, and to run sysdiagnose and Spotlight diagnostic collections. The Inspect selected process tool ⓘ provides information on memory, open files and ports, and more.

Explainers

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

multiterms1

When an app is run, there’s a single runtime instance created from its single executable code. This is given its own virtual memory and access to system resources that it needs. This is a process, and listed as a separate entity in Activity Monitor.

Each process has a main thread, a single flow of code execution, and may create additional threads, perhaps to run in the background. Threads don’t get their own virtual memory, but share that allocated to the process. They do, though, have their own stack. On Apple silicon Macs they’re easy to tell apart as they can only run on a single core, although they may be moved between cores, sometimes rapidly. In macOS, threads are assigned a Quality of Service (QoS) that determines how and where macOS runs them. Normally the process’s main thread is assigned a high QoS, and background threads that it creates may be given the same or lower QoS, as determined by the developer.

Within each thread are individual tasks, each a quantity of work to be performed. These can be brief sections of code and are more interdependent than threads. They’re often divided into synchronous and asynchronous tasks, depending on whether they need to be run as part of a strict sequence. Because they’re all running within a common thread, they don’t have individual QoS although they can have priorities.

Arm 年度技术大会收官，下一代 AI 计算平台在路上了

爱范儿

莫崇宇

21 November 2024 at 18:54

今天下午，一年一度的 Arm Tech Symposia 年度技术大会在深圳圆满结束。

Arm 在本次大会上深入探讨了 AI 对计算的需求，并分享了如何通过硬件、软件、生态系统三大核心更好地把握 AI 的发展机遇，在场与会者也共同探讨了基于 Arm 的技术创新和 AI 发展趋势。

Arm 终端事业部产品管理副总裁 James McNiven 在深圳场的大会主题演讲中强调，Armv9 作为 Arm 最新的技术架构，推出伊始便是为支撑 AI 计算而设计，并持续迭代更新，通过 SVE、SVE2、SME 等关键技术，Arm 以架构创新和强大的软硬件协同能力不断优化移动端 AI 体验，赋能开发者实现卓越的 AI 性能。

在本次大会中，KleidiAI 软件是值得关注的亮点之一。

它实现了与主流 AI 框架的深度集成，能够为开发者提供丝滑的开发体验；当与 Arm CSS 搭配使用时，KleidiAI 通过整合 Neon、SVE2 和 SME2 等一系列 Arm 加速技术，从而显著提升计算应用的性能表现。

据悉，KleidiAI 是一套专门面向 AI 框架开发者的高性能计算内核。

它能够帮助开发者在各种设备上轻松发挥 Arm CPU 上的最佳性能，并充分利用 Neon、SVE2 和 SME2 等关键 Arm 架构的核心特性。

此外，KleidiAI 还集成了 PyTorch、Tensorflow、MediaPipe 等热门 AI 框架，对 Meta Llama 3、Phi-3 等模型进行了性能优化，并且还采用了可前后兼容的设计。

这样做的好处是，确保 Arm 未来在引入更多技术时依然能适用未来市场的需求。

据介绍，KleidiAI 的集成显著提升了生成式 AI 的工作效率。

数据显示，与参考实现方案（基于 llama.cpp，但不含 Kleidi 软件优化）相比，在新的 Arm Cortex-X925 CPU 上，使用（集成了 KleidiAI 的）llama.cpp 的 Meta Llama 3 和微软 Phi-3 大语言模型 (LLM) 的词元 (Token) 首次响应时间加快了 190%。

KleidiAI 的另一大优势在于易于集成。

据悉，Arm 的工程团队只用不到 24 小时就完成了 Llama 3 的性能优化测试。

此外，KleidiAI 还通过 XNNPACK 与 MediaPipe 集成，为在移动设备上运行的开源 Gemma LLM 提供支持。得益于此，Google Pixel 8 Pro 智能手机上 Gemma 2B 的词元首次响应时间缩短了 25%。

与此同时，Arm 还与 Unity 合作开发端侧 AI 推理引擎——Sentis，可让游戏开发者在所有支持 Unity 游戏引擎的设备上打造全新的 AI 游戏体验。

另外，作为迄今速度最快的 Arm 计算平台，Arm 终端 CSS 在计算和图形性能方面实现了超过 30% 的提升，足以应对各类严苛的 Android 工作负载。

与此同时，Arm 终端 CSS 也提高了 59% 的 AI 推理速度，适用于更广泛的 AI/机器学习 (ML) 和计算视觉工作负载。

Arm 终端 CSS 的核心优势在于其搭载了 Arm 迄今性能最强、效率最高、功能最全面的 CPU 集群，致力于实现性能与能效的最佳平衡。

而凭借新一代 Arm Cortex®-X CPU，AI 优化的 Arm 终端 CSS 带来最高的 IPC 同比提升，性能提高 36%；新的 Arm Immortalis GPU 的图形性能提高 37%。

Arm Immortalis-G925 GPU 是 Arm 性能最强、效率最高的 GPU，在多款手游应用中实现了 37% 的性能提升，并在多个 AI 和 ML 网络上提升了 34% 的性能。

Immortalis-G925 主要面向旗舰智能手机市场。

而包括 Arm Mali-G725 和 Mali-G625 GPU 在内的全新高可扩展性 GPU 系列，则面向从高端手机到智能手表和 XR 可穿戴设备等广泛的消费电子设备市场。

Arm 预计到 2025 年底，全球将有超过 1000 亿台具备 AI 能力的 Arm 设备。

从传感器、智能手机，到工业物联网、汽车和数据中心，就像建造摩天大楼需要坚实的地基，AI 技术的蓬勃发展也离不开强大而高效的计算平台作为支撑。

凭借在芯片架构与技术创新上的不懈努力，Arm 正在为这座「AI 摩天大楼」打造最可靠的基石，也将在这场技术变革中扮演愈发关键的角色。

#欢迎关注爱范儿官方微信公众号：爱范儿（微信号：ifanr），更多精彩内容第一时间为您奉上。

爱范儿 | 原文链接 · 查看评论 · 新浪微博

Last Week on My Mac: Mac mini M4 Pro first impressions, cores and more

The Eclectic Light Company

hoakley

10 November 2024 at 16:00

My Mac mini M4 Pro arrived four days early, and swiftly exceeded all expectations. Everyone who has seen this new design remarks on how tiny it is, and I still keep looking at it, wondering how the smallest Mac I have ever owned is also by far the fastest. For those who collect numbers, it matched the growing collection in Geekbench’s database, with a single-core score of 3,892, a multi-core of 22,706, and Metal GPU of 110,960. Of course those new MacBook Pros with M4 Max chips are doing better with their extra two CPU cores and twice the number of GPU cores, but second best is still far superior to anything I’ve run before.

Not only is this my fastest Mac ever, but it was also the quickest and simplest to commission. It replaces my Mac Studio M1 Max, and now sits under my Studio Display using the Studio’s old keyboard and trackpad. As I’ll explain in a future article, I opted to migrate to the mini during initial setup using the Studio’s backup SSD, a sleek OWC Envoy Pro SX connected by Thunderbolt 3. Including the inevitable macOS update, the whole process took less than two hours, from the start of unboxing to the first full Time Machine backup.

With 48 GB memory and an internal SSD of 2 TB, I was keen to benchmark that once its initial backup was done. Although I have reduced confidence in these figures, there’s no doubt that this SSD is significantly quicker than the 2 TB in the Studio; how much quicker is open to debate. AmorphousDiskMark reported a write speed of 7.7 GB/s, while my own Stibium gave 10.3 GB/s across a broad range of file sizes. There was closer agreement on read speed, at around 6.8 GB/s. Both of those are comfortably above what I expect when I can finally get hold of an external Thunderbolt 5 SSD.

CPU cores

My M4 Pro is the full-performance variant, with 10 P and 4 E cores, set in three clusters: an E cluster with all four E cores, and two clusters of five P cores. Since its M3 chips, Apple has increased the maximum number of cores in a cluster from 4 (in M1 and M2 chips) to 6 (M3 and M4).

M4 P cores run at frequencies between 1260 and 4512 MHz (1.3-4.5 GHz), their maximum frequency being 111% that of the M3 P core, and 140% that of the M1 P core. M4 E cores have a narrower range of frequencies than those in the M3, between 1020-2592 MHz (1.0-2.6 GHz), so running at 137% of M3 frequency when running low QoS threads, and 94% of M3 when running at maximum frequency for high QoS threads spilt over from P cores. Those are going to make comparisons between E core performance very interesting indeed.

Single-thread and single-core comparisons of P core throughput are inevitably in the M4’s favour, with significantly better performance across integer, floating point and vector performance, as shown in the chart below.

M4M3multiTests

The Y axis here gives loop throughput per second for my four basic in-core performance tests, a tight assembly code integer math loop, another tight assembly code loop of floating point math, NEON vector processor assembly code, and a tight loop calling an Accelerate routine run in the NEON unit. Pale blue bars are results for the M1, purple for the M3, and red for the shiny new M4.

Although improvements in integer and floating point performance might appear small here, these results are for single threads. When scaled up across more P cores, the differences are magnified, as shown in the regressions below.

M4vM3floatTimevThreads

This plots total times taken to execute multiple threads, each consisting of 10^9 (one thousand million, or one billion) loops of floating point assembly code, against the number of threads, the same as the number of cores used, as each core runs one thread. According to the fitted linear regression equations, each thread/core of 10^9 loops takes the same period: 6.04 seconds for the M3 P cores, and 4.69 for the M4. Those equate to 1.66 x 10^8 loops/second per thread for the M3, and 2.13 x 10^8 for the M4. On that basis, the M4 performs at 128% that of the M3, significantly better than expected by the 111% increase in maximum core frequency. That requires the M4 to have improved processing speed for the same frequency of the M3, a frequency-independent improvement.

Core allocation

One striking difference between M4 CPU cores and those of previous Apple silicon chips is their pattern of core allocation, something you may notice even in Activity Monitor, for all its faults. This is best illustrated in its CPU History window.

For these two series of tests, I waited until the Mac was idling quietly, with little happening on its CPU cores. I then ran, in rapid succession, a series of my in-core floating point tests of 10^9 loops at high QoS, starting with a single thread, then two, and so on up to six or eight threads. This results in a pattern distinct from anything you’ll see in Intel cores, as I have shown here since the early days of the M1.

m3profloptcompo

This is the series seen on my MacBook Pro M3 Pro running macOS 15.2 beta, and is typical of M1, M2 and M3 chips. Although the cores are out of order here, these are the six cores in the chip’s single P cluster. At the left is the obvious peak from 1 thread running on Core 9, then the 2 threads of the next test appear on cores 9 and 12. Three fully occupy 7, 9 and 12, with a little spilt onto 8, and so on until all six are fully occupied with 6 threads.

Activity Monitor is a little too crude to see how distinct this is, for which you have to resort to shorter sampling periods of 0.1 second, and the finer detail of active residency reported by powermetrics. Those confirm that, much of the time, these heavy in-core compute loads run in a single thread on a single core, and if they are moved around, it’s relatively infrequently. E cores are different, though.

Compare that with a similar sequence, this time for 1-8 threads, on the M4 Pro. Its ten P cores are divided into two clusters, which I’ve separated here with the blue line, although within each cluster the cores aren’t in numeric order.

m4profloptcompo

Reading again from the left, a single thread is run in cores 6 and 8, with a little in 14 in the second cluster. Two threads are run in all five cores of the second cluster, with a little at the end from those of the first cluster. Three threads are similar, with significant contributions from all ten cores, and from then on they are similar.

So, in the M4 P cores threads are no longer allocated to single cores, and it appears from the CPU History window that they may even be allocated across more than one cluster. In fact, when the detailed active residency and frequency data from powermetrics are used, while threads are often moved between two cores in the same cluster, macOS will normally avoid unnecessarily running threads on other clusters, although it will happily move all threads from one cluster to another.

The rationale behind running all threads within the same cluster is clear, as CPU core frequencies of all cores in the same cluster are the same. Constraining threads to a single cluster thus allows others to idle and use less energy. While that continues in M4 chips, it isn’t clear why threads are moved between cores so frequently, or moved together to another cluster, and whether that brings improvements in performance.

Why % CPU in Activity Monitor isn’t what you think

The Eclectic Light Company

hoakley

8 November 2024 at 15:30

One of the most frequently quoted measurements in Activity Monitor is % CPU. When the fans blow, or a Mac becomes sluggish, it’s often the first figure to look at and wonder why that process is pulling over 100%. That’s the first thing you learn: here, per cent doesn’t mean out of 100, but the sum of the real percentage for each core. As this Mac has eight cores, that allows it up to 800%, until you realise that Intel processors can use Hyper-Threading to pretend that each is two cores, making a maximum of 1,600%, not bad for a single processor. Then you come completely unstuck with Apple silicon.

Apple defines % CPU as “the percentage of CPU capability that’s being used”, a phrase that doesn’t appear to have any definition either in Apple’s documentation or elsewhere. So like everyone else, you assume that 100% for a core represents its maximum processing capacity. Only you couldn’t be more incorrect: in truth, the number given as % CPU is nothing like that.

Tests

To examine this more clearly, I have an app that runs tight loops of assembly code to load CPU cores on Apple silicon chips and measure their performance. For this, I got it to run several threads, each consisting of 500 million loops of floating point calculations. By running those at different Quality of Service (QoS) settings, macOS will send them to different core types.

When I run two threads at low QoS so they’re run only on E cores of a MacBook Pro M3 Pro, each takes 10.2 seconds to complete. If I instead run eight threads at high QoS, the first six of those will be run on the P cores, but the remaining two spill over onto two of the E cores, where they complete in just 3.0 seconds each.

Yet, according to % CPU in Activity Monitor, the two threads on the E cores come to 200%, and the two from eight run in the second test also come to 200%. You can see this in the CPU History window.

m3activitymon

Here are the 6 E cores at the top, and 6 P cores below. ① marks the first test on two E cores, and shows the total of 200% spread across all six E cores over the 10.2 seconds it took to complete. ② marks the second test, with all 6 P cores and two E cores hitting 100% each, for a total of 800%, as shown in Activity Monitor’s window.

By now I’m sure you guessing how running the same code on the E cores could vary in speed by a factor of 3.4 (10.2 against 3.0 seconds) although they’re the same % CPU: that measurement doesn’t take into account core frequency.

Frequency

Unlike traditional Intel CPUs, CPU cores in Apple silicon chips can be run at a wide range of frequencies, as set by macOS. To make this frequency control a bit simpler, cores are grouped into clusters, that also share L2 cache, and within each cluster all cores run at the same frequency. The M3 Pro chip consists of two clusters, one of 6 E cores, the other of 6 P cores.

When macOS loads two of those E cores with low QoS threads, it sets them to run at a low frequency to make the most of their energy efficiency. When it loads two E cores with threads that were intended to be run on P cores, as they have a high QoS, it runs the E cluster up to high frequency, so they perform better.

Measuring frequency of CPU cores is straightforward using the powermetrics command tool. When those threads were being run on E cores at low QoS, frequency was held at their minimum of 744 MHz, but run at high QoS their frequency was set to maximum, at 2748 MHz, 3.7 times faster. If full load at maximum frequency is 100% of their capability, then at low QoS they were really running at 27%, not the 100% given by Activity Monitor, and that largely accounts for the difference in time that those threads took to complete their floating point calculations.

What is % CPU?

What Activity Monitor actually shows as % CPU or “percentage of CPU capability that’s being used” is what’s better known as active residency of each core, that’s the percentage of processor cycles that aren’t idle, but actively processing threads owned by a given process. But it doesn’t take into account the frequency or clock speed of the core at that time, nor the difference in core throughput between P and E cores.

The next time that you open Activity Monitor on an Apple silicon Mac, to look at % CPU figures, bear in mind that while those numbers aren’t completely meaningless, they don’t mean what you might think, and what’s shown as 100% could be anything between 27-100%.

Which M4 chip and model?

The Eclectic Light Company

hoakley

7 November 2024 at 15:30

In the light of recent news, you might now be wondering whether you can afford to wait until next year in the hope that Apple then releases the M4 Mac of your dreams. To help guide you in your decision-making, this article explains what chip options are available in this month’s new M4 models, and how to choose between them.

CPU core types

Intel CPUs in modern Macs have several cores, all of them identical. Whether your Mac is running a background task like indexing for Spotlight, or running code for a time-critical user task, code is run across any of the available cores. In an Apple silicon chip like those in the M4 family, background tasks are normally constrained to efficiency (E) cores, leaving the performance (P) cores for your apps and other pressing user tasks. This brings significant energy economy for background tasks, and keeps your Mac more responsive to your demands.

Some tasks are normally constrained to run only on E cores. These include scheduled background tasks like Spotlight indexing, Time Machine backups, and some encoding of media. Game Mode is perhaps a more surprising E core user, as explained below.

Most user tasks are run preferentially on P cores, when they’re available. When there are more high-priority threads to be run than there are available P cores, then macOS will normally send them to be run on E cores instead. This also applies to threads running a Virtual Machine (VM) using lightweight virtualisation, whose threads will be preferentially scheduled on P cores when they’re available, even when code being run in the VM would normally be allocated to E cores.

macOS also controls the clock speed or frequency of cores. For background tasks running on E cores, their frequency is normally held relatively low, for best energy efficiency. When high-priority threads overspill onto E cores, they’re normally run at higher frequency, which is less energy-efficient but brings their performance closer to that of a P core. macOS goes to great lengths to schedule threads and control core frequencies to strike the best balance between energy efficiency and performance.

Unfortunately, it’s normally hard to see effects of frequency in apps like Activity Monitor. Its CPU % figures only show the percentage of cycles that are used for processing, and make no allowance for core frequency. It will therefore show a background thread running at low frequency but 100%, the same as a thread overspilt from P cores running at the maximum frequency of that E core. So when you see Spotlight indexing apparently taking 200% of CPU % on your Mac’s E cores, that might only be a small fraction of their maximum capacity if they were running at maximum frequency.

There are no differences between chips in the M4 family when it comes to each type of CPU core: each P core in a Base variant is the same as each in an M4 Pro or Max, with the same maximum frequency, and the same applies to E cores. macOS also allocates threads to different types of core using the same rules, and their frequencies are controlled the same as well. What differs between them is the number of each type of core, ranging from 4 P and 4 E in the 8-core variant of the Base M4, up to 12 P and 4 E in the 16-core variant of the M4 Max. Thus, their single-core benchmark results should be almost identical, although their multi-core results should vary according to the number of cores.

Game Mode

This mode is an exception to normal CPU and GPU core use, as it:

gives preferential access to the E cores,
gives highest priority access to the GPU,
uses low-latency Bluetooth modes for input controllers and audio output.

However, my previous testing didn’t demonstrate that apps running in Game Mode were given exclusive access to E cores. But for gamers, it now appears that the more E cores, the better.

GPU cores

These are also used for tasks other than graphics, such as some of the more demanding calculations required for Machine Learning and AI. However, experience so far with Writing Tools in Sequoia 15.1 is that macOS currently offloads their heavy lifting to be run off-device in one of Apple’s dedicated servers. Although having plenty of GPU cores might well be valuable for non-graphics purposes in the future, for now there seems little advantage for many.

Thunderbolt 5

M4 Pro and Max, but not Base variants, come equipped with Thunderbolt ports that not only support Thunderbolt 3 and 4, but 5, as well as USB4. Thunderbolt 5 should effectively double the speed of connected TB5 SSDs, but to see that benefit, you’ll need to buy a TB5 SSD. Not only are they more expensive than TB3/4 models, but at present I know of only one range that’s due to ship this year. There will also be other peripherals with TB5 support, including at least one dock and one hub, although neither is available yet. The only TB5 accessories that are already available are cables, and even they are expensive.

TB5 also brings increased video bandwidth and support for DisplayPort 2.1, although even the M4 Max can’t make full use of that. If you’re looking to drive a combination of high-res displays, consult Apple’s Tech Specs carefully, as they’re complicated.

Although TB5 will become increasingly important over the next few years, TB3/4 and USB4 are far from dead yet and are supported by all M4 models.

Which M4 chip?

The table below summarises key figures for each of the variants in the M4 family that have now been released. It’s likely that next year Apple will release an Ultra, consisting of two M4 Max chips joined in tandem, in case you feel the burning desire for 24 P and 8 E cores.

m4configs2

Models available next week featuring each M4 chip are shown with green rectangles at the right.

There are two variants of the Base M4, one with 4P + 4E and 8 GPU cores, the same as Base variants in M1 to M3 families. There’s also the more capable variant, for the first time with 4P + 6E, which promises to be a better all-rounder, and when in Game Mode. It also has an extra couple of GPU cores.

The M4 Pro also comes in two variants, this time differing in the number of P cores, 8 or 10, and GPU cores, 16 or 20. Those overlap with the M4 Max, with 10 or 12 P cores and 32 or 40 GPU cores. Thus the gap between M4 Pro and Max isn’t as great as in the M3, with the GPUs in the M4 Max being aimed more at those working with high-res video, for instance. For more general use, there’s little difference between the 14-core Pro and Max.

Memory and storage

Chips in the M4 family also determine the maximum memory and internal SSD capacity. Apple has at last eliminated base models with only 8 GB of memory, and all now start with at least 16 GB. Base M4 chips are limited to a maximum of 32 GB, while the M4 Pro can go up to 64 GB, and the 16-core Max up to 128 GB, although in its 14-core variant, the Max is only available with 36 GB (I’m very grateful to Thomas for pointing this out below).

Unfortunately, Apple hasn’t increased the minimum size of internal SSD, which remains at 256 GB for some base models. Smaller SSDs may be cheaper, but they are also likely to have shorter lives, as under heavy use their small number of blocks will be erased for reuse more frequently. That may shorten their life expectancy to much less than the normal period of up to 10 years, as was seen in some of the first M1 models. This is more likely to occur when swap space is regularly used for virtual memory. I for one would have preferred 512 GB as a starting point.

While Base M4 chips come with SSDs up to 2 TB in size, both Pro and Max can be supplied with internal SSDs of up to 8 TB.

I hope this proves useful in guiding your decision.

Last Week on My Mac: M4 incoming

The Eclectic Light Company

hoakley

3 November 2024 at 16:00

Almost exactly a year after it released its first Macs featuring chips in the M3 family, Apple has replaced those with the first M4 models. Benchmarkers and core-counters are now busy trying to understand how these will change our Macs over the coming year or so. Before I reveal which model I have ordered, I’ll try to explain how these change the Mac landscape, concentrating primarily on CPU performance.

CPU cores

CPUs in the first two families, M1 and M2, came in two main designs, a Base variant with 4 Performance and 4 Efficiency cores, and a Pro/Max with 8 P and 2 or 4 E cores, that was doubled-up to make the Ultra something of a beast with its 16 P and 4 or 8 E cores. Last year Apple introduced three designs: the M3 Base has the same 4 P and 4 E CPU core configuration as in the M1 and M2 before it, but its Pro and Max variants are more distinct, with 6 P and 6 E in the Pro, and 10-12 P and 4 E cores in the Max. The M4 family changes this again, improving the Base and bringing the Pro and Max variants closer again.

As these are complicated by sub-variants and binned versions, I have brought the details together in a table.

mcorestable2024

I have set the core frequencies of the M4 in italics, as I have yet to confirm them, and there’s some confusion whether the maximum frequency of the P core is 4.3 or 4.4 GHz.

Each family of CPU cores has successively improved in-core performance, but the greatest changes are the result of increasing maximum core frequencies and core numbers. One crude but practical way to compare them is to total the maximum core frequencies in GHz for all the cores. Strictly speaking, this should take into account differences in processing units between P and E cores, but that also appears to have changed with each family, and is hard to compare. In the table, columns giving Σfn are therefore simply calculated as
(max P core frequency x P core count) + (max E core frequency x E core count)

Plotting those sum core frequencies by variant for each of the four families provides some interesting insights.

mcoresbars2024

Here, each bar represents the sum core frequency of each full-spec variant. Those are grouped by the variant type (Base, Pro, Max, Ultra), and within those in family order (M1 purple, M2 pale blue, M3 dark blue, M4 red). Many trends are obvious, from the relatively low performance expected of the M1 family, except the Ultra, and the changes between families, for example the marked differences in the M4 Pro, and the M3 Max, against their immediate predecessors.

Sum core frequencies fall into three classes: 20-30, 35-45, and greater than 55 GHz. Three of the four chips in the M1 family are in the lowest of those, with only the M1 Ultra reaching the highest. The M4 is the first Base variant to reach the middle class, thanks in part to its additional two E cores. Two of the M4 variants (Pro and Max) have already reached the highest class, and any M4 Ultra would reach far above the top of the chart at 128 GHz.

Real-world performance will inevitably differ, and vary according to benchmark and app used for comparison. Although single-core performance has improved steadily, apps that only run in a single thread and can’t take advantage of multiple cores are likely to show little if any difference between variants in each family.

Game Mode is also of interest for those considering the two versions of the M4 Base, with 4 or 6 E cores. This is because that mode dedicates the E cores, together with the GPU, to the game being played. It’s likely that games that are more CPU-bound will perform significantly better on the six E cores of the 10-Core version of the iMac, which also comes with a 10-core GPU and four Thunderbolt 4 ports.

Memory and GPU

Memory bandwidth is also important, although for most apps we should assume that Apple’s engineers match that with likely demand from CPU, GPU, neural engine, and other parts of the chip. There will always be some threads that are more memory-bound, whose performance will be more dependant on memory bandwidth than CPU or GPU cores.

Although Apple claims successive improvements in GPU performance, the range in GPU cores has started at 8 and attained 32-40 in Max chips. Where the Max variants come into their own is support for multiple high-res displays, and challenging video editing and processing.

Thunderbolt and USB 3

The other big difference in these Macs is support for the new Thunderbolt 5 standard, available only in models with M4 Pro or M4 Max chips; Base variants still only support Thunderbolt 4. Although there are currently almost no Thunderbolt 5 peripherals available apart from an abundant supply of expensive cables, by the end of this year there should be at least one range of SSDs and one dock shipping.

As ever with claimed Thunderbolt performance, figures given don’t tell the whole story. Although both TB4 and USB4 claim ‘up to’ 40 Gb/s transfer rates, in practice external SSD performance is significantly different, with Thunderbolt topping out at about 3 GB/s and USB4 reaching up to 3.4 GB/s. In practice, TB5 won’t deliver the whole of its claimed maximum of 120 Gb/s to a single storage device, and current reports are that will only achieve disk transfers at 6 GB/s, or twice TB4. However, in use that’s close to the expected performance of internal SSDs in Apple silicon Macs, and should make booting from a TB5 external SSD almost indistinguishable in terms of speed.

As far as external ports go, this widens the gap between the M4 Pro Mac mini’s three TB5 ports, which should now deliver 3.4 GB/s over USB4 or 6 GB/s over TB5, and its two USB-C ports that are still restricted to USB 3.2 Gen 2 at 10 Gb/s, equating to 1 GB/s, the same as in M1 models from four years ago.

My choice

With a couple of T2 Macs and a MacBook Pro M3 Pro, I’ve been looking to replace my original Mac Studio M1 Max. As it looks likely that an M4 version of the Studio won’t be announced until well into next year, I’m taking the opportunity to shrink its already modest size to that of a new Mac mini. What better choice than an M4 Pro with 10 P and 4 E cores and a 20-core GPU, and the optional 10 Gb Ethernet? I seldom use the fourth Thunderbolt port on the Studio, and have already ordered a Kensington dock to deliver three TB5 ports from one on the Mac, and I’m sure it will drive my Studio Display every bit as well as the Studio has done.

If you have also been tempted by one of the new Mac minis, I was astonished to discover that three-year AppleCare+ for it costs less than £100, that’s two-thirds of the price that I pay each year for AppleCare+ on my MacBook Pro.

I look forward to diving deep into both my new Mac and Thunderbolt 5 in the coming weeks.

CPU 和 GPU - 异构计算的演进与发展

面向信仰编程

9 April 2021 at 08:00