Normal view

There are new articles available, click to refresh the page.

Before yesterdayMain stream

The Eclectic Light Company
Inside M4 chips: CPU power, energy and mystery
25 November 2024 at 15:30

Inside M4 chips: CPU power, energy and mystery

By: hoakley

25 November 2024 at 15:30

Few comparisons or benchmarks for M-series chips take into account the reason for equipping Apple silicon chips with more than one CPU core type, according to Arm’s big.LITTLE architecture. Measuring single- or multi-core performance ignores the purpose of E cores, and estimating overall power use can’t compare those core types. This article tries to estimate the cost in terms of power and energy of running identical tests on M4 P and E cores, and thereby provide insight into some of the most distinctive features of Apple silicon, and their benefits.

Methods

To run these two in-core performance tests I use a GUI app wrapped around a series of loading tests designed to enable the CPU core to execute that code as fast as possible, and with as few extraneous influences as possible. Both tests used here are written in assembly code, and aren’t intended to be purposeful in any way, nor to represent anything that real-world code might run. Those are:

64-bit floating point arithmetic, including an FMADD instruction to multiply and add, and FSUB, FDIV and FADD for subtraction, division and addition;
32-bit 4-lane dot-product vector arithmetic (NEON), including FMUL, two FADDP and a FADD instruction;

Source code of the loops is given in the Appendix.

The GUI app sets the number of loops to be performed, and the number of threads to be run. Each set of loops is then put into the same Grand Central Dispatch queue for execution, at a Quality of Service of the maximum of 33. That ensures they are run preferentially on P cores, but will spill over to E cores when no P core is available, when more than 10 threads are run concurrently. Timing of thread execution is performed using Mach Absolute Time, and the time for each thread to be executed is displayed at the end of the tests.

For these tests, the total number of loops to be executed in each thread was set at 5 x 10^8 for floating point, and 3.5 x 10^9 for NEON. Those values were chosen to take 2-3 seconds per thread, to ensure the whole test period was available for analysis.

Immediately before running each test, I launch powermetrics from the command line, to gather core power and performance data in sampling periods of 0.1 second for a total of 50 samples. Its output is piped into a text file, which is then analysed using Numbers and DataGraph. All tests were conducted on a Mac mini M4 Pro with 10 P and 4 E cores, running macOS 15.1.1 in standard power mode.

Each test was inspected individually, and seen to contain the following phases:

small initial activity resulting from bringing the GUI app into focus, and clicking the Run button;
a brief period of low activity, typically with total CPU power at below 50 mW;
1-2 sample periods when threads are loaded onto the cores;
15-21 sample periods when threads are run, whose total CPU power measurements are collected for analysis;
1-2 sample periods when threads are unloaded;
a return to low activity, typically with total CPU power returning below 50 mW.

Means and standard deviations were then calculated for each series of power measurements, and pooled with times taken to execute threads.

Power used by thread

The first pair of graphs shows average power use for the number of threads run, shown here with error bars giving the range of +1 standard deviation. These show two sections: for 1-10 threads, when all were running on P cores, and for 11-14 threads, when the 10 P cores were fully committed and 1-4 threads spilt over to run on E cores at their maximum frequency. Maximum power used during testing was just short of 34 W.

m4powerflopt1

That for the floating point test above, and NEON below, have regression lines fitted, indicating that:

Each additional floating point thread required 1,300 mW on P cores, and 110 mW on E cores.
Each additional NEON thread required 3,000 mW on P cores, and 280 mW on E cores.
P cores thus required 11-12 times the power of E cores, or E cores used 8-9% of the power of P cores.

m4powerneon1

Although linear regressions aren’t a bad fit, there’s consistent deviation from the linear relationship seen in previous analyses on M1 and M3 cores. More remarkably, the pattern of deviation is identical between these two tests, although they run in different units in these cores. In both cases, power use was high for 2 and 7 threads, while that for 3 and 8 threads was slightly lower. The only unusual pattern seen in powermetrics output was that, when running 2 and 7 threads, thread mobility was much higher than in other tests.

Previous tests on M1 and M3 P cores found that each additional floating point thread run on those requires about 935 mW, indicating a substantial increase in power used by M4 P cores when running at their higher maximum frequency. E cores in an M1 Pro require about 100 mW each when running at maximum frequency, similar to those in the M4.

Execution time

As power is the rate of energy use over time, the next step is to examine total execution time for all the threads running concurrently, which should form a linear relationship with different gradients for P and E cores. The next two graphs demonstrate that.

m4powerflopt2

For both floating point (above) and NEON (below), there’s a tight linear relationship between total execution time and numbers of threads. Floating point demonstrates that each thread costs 2.4 seconds on P cores and 3.6 seconds on E cores, making E core execution time 150% that of P cores. NEON is similar, at 2.5 seconds on P cores and 3.4 seconds on E cores, for a ratio of 136%.

m4powerneon2

Time taken for the slowest thread to complete execution shows interesting finer detail.

m4powerflopt3

For both tests, performance falls into several sections according to the number of threads run. With less than 5 threads run, there’s a sharp rise in time taken per thread. From 5-10 threads, time required remains constant, before increasing from 10-14 threads, when additional threads are spilt over onto E cores.

This has implications for anyone trying to measure core performance, as it demonstrates that a single thread can run disproportionately fast, compared with 3-10 threads. Basing any conclusion or comparison on a single thread completing in little more than 2 seconds, when 5 concurrent threads would take 2.34 seconds, 117% of the single thread, could be misleading.

m4powerneon3

Energy use

Although power use determines heat production, so is an important factor in determining cooling requirements, total energy required to execute threads is equally important for Macs running from battery. Simply reducing core frequency will reduce power used, but by extending the time taken to complete tasks, it may have no effect on energy used, and battery endurance. My final two graphs therefore show estimated total energy used when running test threads on P and E cores, the ultimate test of any big.LITTLE CPU design such as that in the M4.

m4powerflopt4

Graphs for floating point (above) and NEON (below) are inevitably similar in form to those for power, with a near-linear section from 1-10 cores, when the threads are run only on P cores, and from 11-14 cores when they also spill over to E cores.

Fitted regression lines provide the energy cost for each additional thread:

For floating point, each thread run on a P core costs 3.1 J, and for an E core 1.5 J, making the energy used by an E core 47% that of a P core.
For NEON, P cores cost 7.7 J per thread, and E cores 3.0 J, making the energy used by an E core 38% that of a P core.

It’s important to remember that the E cores here aren’t being run at frequencies for high efficiency, but at their maximum so they can substitute for the P cores that are already in use.

m4powerneon4

Considering the small deviations from those linear relationships, it appears that running 2, 6 or 7 threads on P cores requires slightly more energy than predicted from the regression lines shown.

Unfortunately, assessing the energy used by E cores running at low frequencies, as they normally do when performing background tasks, is fraught with inaccuracies due to their low power use. My previous estimate for floating point tests is that a slow-running E core uses less than 45 mW per thread, and for the same task requires about 7% of the energy used by a P core running at maximum frequency, but I have lower confidence in the accuracy of those figures than in those above for higher frequencies.

Key information

When running the same code at maximum frequency, E cores used 8-9% of the power of P cores.
Power use when running 2 or 7 threads was anomalously high, possibly due to high thread mobility.
Execution on E cores was significantly slower than on P cores, at 136-150% of the time required on P cores.
Single-core performance measurements may not be accurate reflections of performance on multiple cores.
When running the same code at maximum frequency, energy used by an E core is expected to be 38-47% that of a P core.

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance

Appendix: Source code

_fpfmadd: STR LR, [SP, #-16]! MOV X4, X0 ADD X4, X4, #1 FMOV D4, D0 FMOV D5, D1 FMOV D6, D2 LDR D7, INC_DOUBLE fp_while_loop: SUBS X4, X4, #1 B.EQ fp_while_done FMADD D0, D4, D5, D6 FSUB D0, D0, D6 FDIV D4, D0, D5 FADD D4, D4, D7 B fp_while_loop fp_while_done: FMOV D0, D4 LDR LR, [SP], #16 RET

_neondotprod: STR LR, [SP, #-16]! LDP Q2, Q3, [X0] FADD V4.4S, V2.4S, V2.4S MOV X4, X1 ADD X4, X4, #1 dp_while_loop: SUBS X4, X4, #1 B.EQ dp_while_done FMUL V1.4S, V2.4S, V3.4S FADDP V0.4S, V1.4S, V1.4S FADDP V0.4S, V0.4S, V0.4S FADD V2.4S, V2.4S, V4.4S B dp_while_loop dp_while_done: FMOV S0, S2 LDR LR, [SP], #16 RET

A brief history of Mac CPUs

The Eclectic Light Company

By: hoakley

23 November 2024 at 16:00

Macs have used four different architectures for their Central Processing Units over the last 40 years. From their launch by Steve Jobs on 24 January 1984, for the first decade they used Motorola 68K CPUs, then switched to PowerPCs designed by an alliance of Apple, IBM and Motorola, which were used for 12 years. After 14 years being built around Intel processors from 2006, Macs most recently changed a third time to use Apple’s own Arm-based chips.

Over those 40 years, continuous improvements in capabilities and performance of CPUs have transformed Mac OS and the apps it supports.

Motorola 68K

CPUs execute instructions in synchrony with a clock whose frequency determines the rate of instruction execution. The Motorola 68000 processor in the original Mac 128K ambled along at a clock speed of just 8 MHz. The last 68K models featuring 68040 CPUs had raised that to 33 MHz, and added specialist Memory Management Units (MMUs) and floating point units. The latter first appeared as 68881 and 68882 maths co-processors, but were later integrated into the 68040.

MMUs were particularly important for the implementation of virtual memory. When the Macintosh II was introduced in 1987, it was the first Mac that could be fitted with Motorola’s optional 68851 paged MMU, required for it to run Apple’s A/UX port of Unix with virtual memory support. Strangely, Apple’s own MMU fitted in the standard Mac II didn’t support virtual memory. Its 68020 CPU was also the first in Macs to use 32 bits rather than the 16 of the original 68000.

AIM PowerPC

When introduced in 1994, the first Power Macs came with PowerPC 601 or 601+ CPUs running at frequencies up to 110 MHz, nearly 14 times faster than the original Mac 128K. Just over a decade later, the last Power Mac G5 raised that to dual two-core CPUs at 2,500 MHz, more than 20 times the clock frequency, and the previous model had offered dual 2,700 MHz CPUs.

PowerPCs had their origins in IBM’s high-end POWER architecture based on a reduced instruction set (RISC) intended to be run at higher frequencies. Initially, these CPUs used a 32-bit design, but progressed to 64 bits. Not only do they have integrated floating point units that were extended for Apple, but later models include AltiVec vector processing for single-precision floating point and integer operations.

High CPU frequencies bring higher power consumption and heat output. A dual-core G5 with a PowerPC 970MP CPU used a maximum of 100 W at 2,000 MHz, and some of the last G5 Macs used liquid cooling to cope with the heat generated at higher frequencies. Those didn’t prove long-lived, with coolant leaks a common and fatal failing.

Intel x86

In early 2006 Apple started releasing its new range of Macs using Intel CPUs. With the exception of a base model of Mac mini, those came with 2-core Core Duo processors running at up to 2 GHz, and were soon followed by the first Mac Pros featuring two 64-bit 2-core Xeon 5100 CPUs (Woodcrest) at up to 3 GHz. By the following year, the first 8-core Mac Pro was available.

Earlier increases in CPU frequency gradually petered out. The last Intel Mac Pro was available with cores running at 2.5-3.5 GHz, boosted to a maximum of up to 4.4 GHz. Instead, high-end models offered as many as 28 cores and drew power up to 900 W. More typical of late desktop Macs were Intel Core i9 CPUs with 6-8 cores at similar frequencies. Adding more processor cores has been an effective way to run more code at the same time. Tasks are divided into threads that can run relatively independently of one another. Those threads can then be distributed across several CPU cores.

Rising power consumption and heat output were becoming even more of a problem in MacBook Pro models.

Apple Arm

Well before Apple had joined IBM and Motorola in the AIM alliance, it had co-founded the company based in Cambridge, England, that was to become ARM (for Acorn RISC Machines). Its RISC processor was used in Apple’s Newton MessagePad of 1993, and in 2010 Apple released its first iPhone and iPad designed around a single-core 32-bit Arm CPU running at a cool and economical 1 GHz.

From before macOS Mojave in 2018, Apple was preparing for its next migration, to its own integrated Systems on a Chip (SoC), starting with the M1 in 2020. The first iPhone to incorporate two CPU core types was the iPhone 7 of 2016, in its A10 Fusion SoC. Rather than simply adding more cores, Apple had adopted the Arm big.LITTLE architecture, in which background threads are run on slower, more efficient CPU cores, and higher priority user processes run on faster, more performant cores.

The first two families of M1 and M2 chips have cores grouped in clusters of no more than 4, but Apple increased cluster size to 6 in the M3 and M4. While the M1 family consists of two designs, one for the base variant, and the second for both Pro and Max, and doubled in the Ultra, M3 and M4 families have distinct designs for their Pro and Max variants. For the M4, this offers a full range from 8 to 16 cores in total, with an anticipated Ultra extending to 32. In addition to CPU cores with built-in vector processing (NEON), these chips incorporate specialist co-processors such as a neural engine and a proprietary matrix co-processor, AMX.

Performance (‘big’) cores have increased in maximum frequency, from 3.2 GHz in the M1 to 4.5 GHz in the M4 four years later.

Trends

The period 1984-2007 was dominated by increasing CPU frequency, as demonstrated in the two charts below.

maccpuhistoryfreqlin

This chart uses a conventional linear Y axis to demonstrate that frequency rose rapidly during the decade from 1997. As the form of this curve is S-shaped, the chart below shows the same data with a logarithmic Y axis.

maccpuhistoryfreqexp

Since about 2007, Macs haven’t seen substantial frequency increases. Many factors limit the maximum frequency that a processor can run at, including its physical dimensions, but among the most significant in practical terms are its power requirements and heat output, hence its need for cooling. Thus, the period 2005-2017 became dominated by increasing core count.

maccpuhistorycores

This chart shows how the number of processors and cores inside Macs didn’t start rising until around 2005, just as frequencies were topping out. Thus, many of the CPU performance improvements from 2007 onwards have been the result of providing more cores. But there’s a practical limit as to how many of those cores will get used, which is where processing more data becomes important, as it has from 1998 onwards.

It’s remarkable how much of Mac OS has survived if not flourished over those 40 years that our Macs have gone from a pedestrian Motorola 68000 processor to the 12 performance cores capable of 4.5 GHz in an M4 Max chip.

Arm 年度技术大会收官，下一代 AI 计算平台在路上了

爱范儿

By: 莫崇宇

21 November 2024 at 18:54

今天下午，一年一度的 Arm Tech Symposia 年度技术大会在深圳圆满结束。

Arm 在本次大会上深入探讨了 AI 对计算的需求，并分享了如何通过硬件、软件、生态系统三大核心更好地把握 AI 的发展机遇，在场与会者也共同探讨了基于 Arm 的技术创新和 AI 发展趋势。

Arm 终端事业部产品管理副总裁 James McNiven 在深圳场的大会主题演讲中强调，Armv9 作为 Arm 最新的技术架构，推出伊始便是为支撑 AI 计算而设计，并持续迭代更新，通过 SVE、SVE2、SME 等关键技术，Arm 以架构创新和强大的软硬件协同能力不断优化移动端 AI 体验，赋能开发者实现卓越的 AI 性能。

在本次大会中，KleidiAI 软件是值得关注的亮点之一。

它实现了与主流 AI 框架的深度集成，能够为开发者提供丝滑的开发体验；当与 Arm CSS 搭配使用时，KleidiAI 通过整合 Neon、SVE2 和 SME2 等一系列 Arm 加速技术，从而显著提升计算应用的性能表现。

据悉，KleidiAI 是一套专门面向 AI 框架开发者的高性能计算内核。

它能够帮助开发者在各种设备上轻松发挥 Arm CPU 上的最佳性能，并充分利用 Neon、SVE2 和 SME2 等关键 Arm 架构的核心特性。

此外，KleidiAI 还集成了 PyTorch、Tensorflow、MediaPipe 等热门 AI 框架，对 Meta Llama 3、Phi-3 等模型进行了性能优化，并且还采用了可前后兼容的设计。

这样做的好处是，确保 Arm 未来在引入更多技术时依然能适用未来市场的需求。

据介绍，KleidiAI 的集成显著提升了生成式 AI 的工作效率。

数据显示，与参考实现方案（基于 llama.cpp，但不含 Kleidi 软件优化）相比，在新的 Arm Cortex-X925 CPU 上，使用（集成了 KleidiAI 的）llama.cpp 的 Meta Llama 3 和微软 Phi-3 大语言模型 (LLM) 的词元 (Token) 首次响应时间加快了 190%。

KleidiAI 的另一大优势在于易于集成。

据悉，Arm 的工程团队只用不到 24 小时就完成了 Llama 3 的性能优化测试。

此外，KleidiAI 还通过 XNNPACK 与 MediaPipe 集成，为在移动设备上运行的开源 Gemma LLM 提供支持。得益于此，Google Pixel 8 Pro 智能手机上 Gemma 2B 的词元首次响应时间缩短了 25%。

与此同时，Arm 还与 Unity 合作开发端侧 AI 推理引擎——Sentis，可让游戏开发者在所有支持 Unity 游戏引擎的设备上打造全新的 AI 游戏体验。

另外，作为迄今速度最快的 Arm 计算平台，Arm 终端 CSS 在计算和图形性能方面实现了超过 30% 的提升，足以应对各类严苛的 Android 工作负载。

与此同时，Arm 终端 CSS 也提高了 59% 的 AI 推理速度，适用于更广泛的 AI/机器学习 (ML) 和计算视觉工作负载。

Arm 终端 CSS 的核心优势在于其搭载了 Arm 迄今性能最强、效率最高、功能最全面的 CPU 集群，致力于实现性能与能效的最佳平衡。

而凭借新一代 Arm Cortex®-X CPU，AI 优化的 Arm 终端 CSS 带来最高的 IPC 同比提升，性能提高 36%；新的 Arm Immortalis GPU 的图形性能提高 37%。

Arm Immortalis-G925 GPU 是 Arm 性能最强、效率最高的 GPU，在多款手游应用中实现了 37% 的性能提升，并在多个 AI 和 ML 网络上提升了 34% 的性能。

Immortalis-G925 主要面向旗舰智能手机市场。

而包括 Arm Mali-G725 和 Mali-G625 GPU 在内的全新高可扩展性 GPU 系列，则面向从高端手机到智能手表和 XR 可穿戴设备等广泛的消费电子设备市场。

Arm 预计到 2025 年底，全球将有超过 1000 亿台具备 AI 能力的 Arm 设备。

从传感器、智能手机，到工业物联网、汽车和数据中心，就像建造摩天大楼需要坚实的地基，AI 技术的蓬勃发展也离不开强大而高效的计算平台作为支撑。

凭借在芯片架构与技术创新上的不懈努力，Arm 正在为这座「AI 摩天大楼」打造最可靠的基石，也将在这场技术变革中扮演愈发关键的角色。

#欢迎关注爱范儿官方微信公众号：爱范儿（微信号：ifanr），更多精彩内容第一时间为您奉上。

爱范儿 | 原文链接 · 查看评论 · 新浪微博

Inside M4 chips: CPU core performance

The Eclectic Light Company

By: hoakley

20 November 2024 at 15:30

There’s no doubt that the CPUs in M4 chips outperform their predecessors. General-purpose benchmarks such as Geekbench demonstrate impressive rises in both single- and multi-core results, in my experience from 3,191 (M3 Pro) to 3,892 (M4 Pro), and 15,607 (M3 Pro) to 22,706 (M4 Pro). But the latter owes much to the increase in Performance (P) core count from 6 to 10. In this series I concentrate on much narrower concepts of performance in CPU cores, to provide deeper insight into topics such as core types and energy efficiency. This article examines the in-core performance of P and E cores, and how they differ.

P core frequencies have increased substantially since the M1. If we set that as 100%, M3 P cores run at around 112-126% of that frequency, and those in the M4 at 140%.

E cores are more complex, as they have at least two commonly used frequencies, that when running low Quality of Service (QoS) threads, and that when running high QoS threads that have spilt over from P cores. Low QoS threads are run at 77% of M1 frequency when on an M3, and 105% on an M4. High QoS threads are normally run at higher frequencies of 133% on the M3 E cores (relative to the M1 at 100%), but only 126% on the M4.