Reading view

There are new articles available, click to refresh the page.

Tune for Performance: Do you really need a big internal SSD?

12 December 2024 at 15:30

The most common economy that many make when specifying their next Mac is to opt for just 512 GB internal storage, and save the $/€/£600 or so it would cost to increase that to 2 TB. After all, if you can buy a 2 TB Thunderbolt 5 SSD for around two-thirds of the price, why pay more? And who needs the blistering speed of that expensive internal storage when your apps work perfectly well with a far cheaper SSD? This article considers whether that choice matters in terms of performance.

To look at this, I’m going to use the same compression task that I used in this previous article, my app Cormorant relying on the AppleArchive framework in macOS to do all the heavy lifting. As results are going to differ considerably when using other apps and other tasks, I’d like to make it clear that this can’t reach general conclusions that apply to every task and every app: your mileage will vary. My purpose here is to show how you can work out whether using a slower disk will affect your app’s performance.

Methods

In that previous article, I concluded that compression speed was unlikely to be determined by disk performance, as even with all 10 P cores running compression threads, that rate only reached 2.28 GB/s, around the same as write speed to a good Thunderbolt 3 SSD. For today’s tests, I therefore set Cormorant to use the default number of threads (all 14 on my M4 Pro) so it would run as fast as the CPU cores would allow when compressing my standard 15.517 GB IPSW test file.

I’m fortunate to have a range of SSDs to test, and here use the 2 TB internal SSD of my Mac mini M4 Pro, an external USB4 enclosure (OWC Express 1M2) with a 2 TB Samsung 990 Pro SSD, and an external 2 TB Thunderbolt 3 SSD (OWC Envoy Pro SX). As few are likely to have access to such a range, I included two disk images stored on the internal SSD, a 100 GB sparse bundle, and a 50 GB read-write disk image, to see if they could be used to model external storage. All tests used unencrypted APFS file systems.

The first step with each was to measure its write speed using Stibium. Unlike more popular benchmarking apps for the Mac, Stibium measures the speed across a wide range of file sizes, from 2 MB to 2 GB, providing more insight into performance. After those measurements, those test files were removed, the large test file copied to the volume, and compressed by Cormorant at high QoS with the default number of threads.

Disk write speeds

compressionbydisk1

Results in each test followed a familiar pattern, with rapidly increasing write speeds to a peak at a file size of about 200 MB, then a steady rate or slow decline up to the 2 GB tested. The read-write disk image was a bit more erratic, though, with high write speeds at 800 and 1000 MB.

At 2 GB file size, write speeds were:

7.69 GB/s for the internal SSD
7.35 GB/s for the sparse bundle
3.61 GB/s for the USB4 SSD
2.26 GB/s for the Thunderbolt 3 SSD
1.35 GB/s for the disk image.

Those are in accord with my many previous measurements of write speeds for those types of storage.

Compression rates

Times to compress the 15.517 GB test file ranked differently:

5.57 s for the internal SSD
5.84 s for the USB4 SSD
9.67 s for the sparse bundle
10.49 s for the Thunderbolt 3 SSD
16.87 s for the disk image.

When converted to compression rates, sparse bundle results are even more obviously an outlier, as shown in the chart below.

compressionbydisk2

There’s a roughly linear relationship between measured write speed and compression rate in the disk image, Thunderbolt 3 SSD, and USB4 SSD, and little difference between the latter and the internal SSD. These suggest that disk write speed becomes the rate-limiting factor for compression when write speed falls below about 3 GB/s, but above that faster disks make little difference to compression rate.

Poor performance of the sparse bundle was a surprise, given how close its write speeds are to those of the host internal SSD. This is probably the result of compression writing a single very large file across its 14 threads; as the sparse bundle stores file data on a large number of band files, their overhead appears to have got the better of it. I will return to look at this in more detail in the near future, as sparse bundles have become popular largely because of their perceived superior performance.

The difference in compression rates between USB4 and Thunderbolt 3 SSDs is also surprisingly large. Of course, a Mac with fewer cores to run compression threads might not show any significant difference: a base M3 chip with 4 P and 4 E cores is unlikely to achieve a compression rate much in excess of 1.5 GB/s on its internal SSD because of its limited cores, so the restricted write speed of a Thunderbolt 3 SSD may not there become the rate-limiting factor.

Conclusions

The rate-limiting step in task performance will change according to multiple factors, including the effective use of multiple threads on multiple cores, and disk performance.
There’s no simple model you can apply to assess the effects of disk performance, and tests using disk images can be misleading.
You can’t predict whether a task will be disk-bound from disk benchmark performance.
Even expensive high-performance external SSDs can result in noticeably poor task performance. Maybe that money would be better spent on a larger internal SSD after all.

Tune for Performance: do more threads run faster?

The Eclectic Light Company

hoakley

10 December 2024 at 15:30

One of the most distinctive features about modern Macs is that they have multiple cores, in Apple silicon models a minimum of eight. For an app to be able to make use of more than one core (or its equivalent) at a time it needs to divide its processing into threads, discrete blocks of code that can be run on different cores by macOS. If it doesn’t do that, then running that app on a high-end Pro, Max or Ultra chip is unlikely to be significantly faster than on a base model (assuming that task is CPU-bound).

Activity Monitor

You might be able to get a good idea as to how well an app makes use of multiple cores from watching it in use in Activity Monitor’s CPU History window, but in many cases that isn’t conclusive, and needs to be confirmed.

To illustrate ways to tackle this, I take the example of file compression. Some methods lend themselves to multiple threads better than others, and some may be implemented in a way that won’t accelerate when several cores are available. For the occasional user this might make little difference, but if you were to spend much of your day waiting for 10-100 GB files to compress, it merits a little exploration.

polycore1

This CPU History window from a compression task run on an M4 Pro is typically unhelpful. A single file compression is seen in the group of 4-5 peaks in CPU in the right half of each trace. Although compression used all ten P cores, it appears to have been moved between P clusters, seen by comparing the timing of peaks on Cores 8 and 10, 9 and 11. At no time does the task appear to exceed 50% active residency on any of the cores, though. All you can do is guess as to what might be going on, and whether it might run faster on twelve or more P cores.

polycore2

On another run, the same task is reported to have reached a peak of 500% CPU in 12 threads, but would it exceed that on more P cores, given that 10 were available?

Timing performance

As usual, I’ve been cheating a little to generate those results, by using my simple compression-decompression app Cormorant, in which I control how many threads it uses. If the app you’re trying to investigate offers a similar feature, then you can set up a standard task, here the compression of a 15.517 GB IPSW file, and time how long it takes using different numbers of threads.

polycore3

The answers from Cormorant, which conveniently performs its own timing, are:

1 thread takes 49.32 seconds
2 take 26.74 s
3 take 18.60 s
4 take 14.29 s
5 take 11.21 s.

So Cormorant’s compression can make good use of more P cores, although with diminishing returns.

Very few apps give you this level of control, though. The only other compression utility that does appear to is Keka.

polycore4

In its settings, you can give its tasks a maximum number of threads, and even run them at custom Quality of Service (QoS) if you want them to be run in the background on E cores, and not interrupt your work on P cores.

Controlling threads

There is one way that you can limit the effective number of threads used by an arbitrary app, and that’s to run it in a Virtual Machine, as you control the number of virtual cores that it uses. While you can just about run a macOS VM on a single core alone, I suggest that a more workable starting point is two cores, and you can increase that number up to the total of P cores in the host without causing problems.

VMs have other virtues, including their relative lack of background processes, allowing their virtual cores to be almost entirely devoted to your test task. However, as they can’t run apps from the App Store other than Apple’s free suite of Pages, Numbers and Keynote, that could prevent you from using them for testing.

To set up a VM for thread tests, I duplicated a standard Sonoma 14.7.1 VM so I could throw it away at the end, opened it with five virtual cores, and copied over the test app Cormorant and file. I then closed that VM down, set it to use 2 virtual cores, opened it and ran my test. I repeated that with an increasing number of virtual cores up to a total of 5. Compression times are:

2 vCPUs take 30.13 seconds
3 take 21.36 s
4 take 17.18 s
5 take 13.59 s.

Those are only slightly slower than their equivalents from the host tests above.

Analysis

polycore5

Plotting those results out using DataGraph, the lines of best fit follow power laws:

for real cores, time = 49.8863/(T^0.91169)
for virtual cores, time = 54.737/(T^0.854318)

where T is the number of threads. That explains the apparently diminishing returns with increasing numbers of threads, although the maths isn’t as simple as we might like.

A better way to look at this is by calculating the rate of compression in GB/s, simply by dividing the file size of 15.517 GB by each time. Here we end up with straight lines from linear regression, that are more amenable to thought.

polycore6

Those regressions are:

for real cores, rate of compression = 0.0464 + (0.264 x T)
for virtual cores, rate of compression = 0.102 + (0.206 x T)

which are more generally useful.

Tuning performance

This might appear over-elaborate and of little practical use, but we now have a much better understanding of the factors limiting compression performance:

The more threads compression uses, the shorter time a task will take.
There are limits to that improvement, though, when substantially more than 5 threads are used.
Compression rates achieved in M4 P cores are significantly lower than read or write speeds of the internal SSD, so aren’t likely to be limited by the speed of a faster SSD (TB3 or USB4).
As compression appears to be CPU-bound, faster P cores would also be expected to result in shorter times.
Improving the efficiency of the compression code could increase performance.
Compression in a VM runs at about 78% of speed on the host.

To see how reliable these are, I therefore repeated the Cormorant test using all 10 P cores on the host Mac, which took 6.80 seconds, a little more than half the time for 5 cores. That’s a compression rate of 2.28 GB/s, rather less than the 2.69 GB/s predicted by the linear regression. That’s now approaching the write speed of some TB3 SSDs, and the rate-limiting step could then change to be the write performance of the storage being used, rather than CPU cores.