Reading view

There are new articles available, click to refresh the page.

Are sparse bundles faster than disk images?

Following tests over two years ago, I recommended the use of sparse bundles (UDSB) rather than plain UDIF read-write (UDRW) disk images, because of their faster and more consistent performance. Since then, read-write disk images have adopted a sparse file format, making them more space-efficient. However, their write performance has remained substantially lower than sparse bundles, and the latter remain the preferred format.

More recent performance tests have confirmed this in Sequoia, as shown in the table below.

xferdiskimage24

Most recently, tests of file compression showed relatively poor performance within a sparse bundle, with overall compression speed only 57% that of the SSD hosting the bundle despite apparently good write speeds. This article explores why, and whether that should alter choice of disk images.

Consistency in write speed

As previous articles note, inconsistent performance troubles disk images. Because of their complex bundle structure and storage across thousands of band files, it’s important to consider whether this poor performance is merely a reflection of that inconsistency.

sparsebundlewritesp0

The original Stibium write speed measurements, separated out from the collation shown in that recent article, show more dispersion about the regression line than other tests. As those are based on a limited number of file sizes, a total of 130 write speed measurements were performed using file sizes ranging from 4.4 MB to 3.0 GB, 50 of them being randomly chosen.

sparsebundlewritesp1

Those confirm that some individual write speeds are unexpectedly low, and dispersion isn’t as low as hoped for, but don’t make inconsistency a plausible explanation.

Band size

Unlike other types of disk image, data saved in sparse bundles is contained in many smaller files termed bands, and the maximum size of those band files can be set when a sparse bundle is created. By default, band size is set at 8.4 MB and there are only two ways to use a custom size, either when creating the sparse bundle using the command hdiutil create or in my free utility Spundle. Although those allow you to set a preferred band size, the size actually used may differ slightly from that.

Band size could have significant effects on the performance and stability of sparse bundles, as it determines how many band files are used to store data. Using a size that results in more than about 100,000 band files has in the past (with HFS+ at least) made sparse bundles prone to failure.

To assess the effect of different band sizes on compression rate, I therefore repeated exactly the same test file compression in sparse bundles with 12 band sizes ranging from 2.0 to 1020 MB, as measured rather than the value set.

sparsebundlebandstime

In the graph above, time to complete the compression task is shown for each of the band sizes tested, and shows a minimum time (fastest compression) at a band size of 15.4 MB, of about 8.3 seconds, significantly lower than the 11.08 seconds at the default of 8.4 MB. Band sizes between 2.0-8.4 MB showed little differences, but increasing band size to 33.8 MB and greater resulted in much slower compression.

sparsebundlebandsrate

When expressed as compression rates, peak performance at a band size of 15.4 MB is even clearer, and approaches 1.9 GB/s, more than twice that of a read-write disk image, but still far below rates in excess of 2.6 GB/s found for the USB4 and internal SSDs. With band sizes above 33.8 MB, compression rates fall below those of a read-write disk image.

Another way of looking at optimum band size is to convert that into the number of band files required. Fastest rates were observed here in sparse bundles that would have about 6,000-7,000 bands in total when the sparse bundle is full, and in this case with 2,000-2,300 band files in use. Performance fell when the maximum number of band files fell below 3,000, or those in use fell below 1,000.

Streaming

Compression tests impose different demands on storage from conventional read or write benchmarks because file data to be compressed is streamed by reading data from storage, and writing compressed data out to the same storage device. Although source and destination are separate files, allowing simultaneous reading and writing, if those files are stored in common band files, it’s likely that access has to be limited and can’t be simultaneous.

As observed here, that’s likely to result in performance substantially below that expected from single-mode transfer tests. Unfortunately, it’s also extremely difficult to measure attempts to simultaneously read and write from different files in the same storage medium.

Conclusions

  • Sparse bundles (UDSB) remain generally faster in use than read-write (UDRW) disk images.
  • Sparse bundle performance is sensitive to band size.
  • Default 8.4 MB bands aren’t fastest in all circumstances.
  • When sparse bundle performance is critical, it may be worth optimising band file size instead of using the default.
  • Tasks that stream data from and back to the same storage are likely to run more slowly in a sparse bundle than on its host storage.

Previous articles

Dismal write performance of Disk Images
Which disk image format?
Disk Images: Performance

Last Week on My Mac: Compression models

Over the fifty years since I started using computers, one of the greatest changes has been who does the waiting. In my early days, it was me, in some cases waiting overnight for batches to be processed on a mainframe computer. By the mid 1990s I had to wait a couple of days for some of my optimisation runs to complete on the Mac in my office. Since then, as Macs have got progressively faster, they have come to do the waiting for me. Whether I’m editing an image or writing an article, my Mac keeps pace with almost everything I do.

One notable if infrequent exception is when there’s a macOS update. On a newer Apple silicon model, installation itself is far quicker than when Apple changed to using the SSV in Big Sur, but there’s always that wait for the downloaded update to be “prepared” in the background. Much of that is thought to be decompression, which minimises the size of downloads by putting the burden on the Mac to expand and organise what needs to be installed.

Data compression is widely used in modern operating systems. Take a look in Activity Monitor’s Memory view, and at the bottom right you’ll see how much compressed memory is in use, here currently over 1 GB. It’s used to save space in macOS, where many system files are stored in compressed formats. Compression is also an integral part of many media formats, whether still or moving images, or audio. Unless they’re compressed, we’d only be able to enjoy a small fraction of the media we now consume so voraciously.

It’s understandable that compression is thus one of the best-studied tasks in computing. Over decades of intensive research and practical experience, we have learned that the most powerful methods of compressing data can also take the most processing time. The effectiveness of compression is generally expressed in terms of compression ratio, the ratio of original file size to that of the compressed file. Thus all effective compression methods should result in a compression ratio of more than 1. You’ll also come across percentage space saving, the reduction in size compared to the original. If a compression method shrinks a 10 GB file to 5 GB, then its compression ratio is 10/5 = 2, and its space saving is (1 – 5/10) x 100 = 50%.

One of the most efficient compression techniques has been that in nanozip 0.09a, for instance, which can achieve a compression ratio of 5.3, or a space saving of 81%, on English text taken from Wikipedia. This illustrates the trade-offs made in such efficient techniques, as to achieve that it takes almost 128 seconds to compress 100 MB, a rate of 0.783 MB/s. That makes it of little everyday use and unsuitable for macOS updates, where it also couldn’t achieve the same high compression ratio because the data are binary rather than English text.

Optimising compression isn’t as simple as it might appear. There’s an inverse relationship between the time spent compressing data and the amount of compressed data to be written, as there is with many similar tasks. The more sophisticated the compression and the higher the compression ratio, the less data there is to be written out, although there’s still the same amount of data to be read. Using the example of a compression ratio of 2, 10 GB of input data has to be read and only 5 GB written out, so unless write speed is less than half read speed, reading the input data is likely to be more rate-limiting.

Compression is also representative of many other types of everyday tasks, as it consists of three sub-tasks:

  1. read data from disk
  2. process it (encode, decode, transcode)
  3. write the transformed data to disk

Unless you use small file sizes, this is usually performed on data streamed from storage and back to storage, rather than on single chunks of data in memory.

To understand what limits performance of any of those everyday tasks, we need to understand how each of its sub-tasks is limited, and how those limitations interact. In some cases, such as preparation of macOS updates, time taken appears to be determined by the processing required in the second step. In the past, Apple has run all that in a single thread to minimise impact on concurrent use of other apps in the foreground. In other tasks, such as encrypting files, stage 2 processing should be fastest, making it most likely that disk access is limiting.

Although we can’t apply conclusions drawn from studying any specific compression method, we can use it as a model to develop tools that we can then apply to other tasks. Over the last week, I’ve published two articles here discussing how to do that.

In the first I asked how we can investigate whether more threads will run a task faster. We know that for my compression task they do, but that may not be true for the tasks you’re most interested in accelerating. In rare cases, an app may provide a control to set how many threads to use, and it’s easy to use that to investigate. It’s more likely that your app has no such control, but one way around that is to run it in a virtual machine, where you can pick the number of virtual cores to host that VM.

Timing the performance of your app with test files of known size should give a performance rate, and that can help identify which of those sub-tasks is probably limiting overall performance. If your app transcodes video, for example, and you time how long it takes to complete all three sub-tasks on a 10 GB video clip, you can work out its transcoding rate in GB/s. If that test clip takes 20 seconds to read that test file, transcode it, and write out the converted video, then its overall transcoding rate is 0.5 GB/s.

The next article considers how you might work out whether running that task on high-speed storage would improve that overall performance. Although it looks unlikely that a transcoding rate of 0.5 MB/s would be altered by running it on a Thunderbolt 3 disk with its 2.3 GB/s write speed, it’s worth comparing performance on the faster internal SSD to check that, and decide whether a faster USB4 SSD might be better.

By the time that you’ve worked through those tests, you should have better insight into which of those three sub-tasks is limiting performance, how you might be able to improve them, and compare current performance in the overall transcoding rate against attempts at improvement. As I wrote last week, the only way to do that is with objective measurements such as that transcoding rate.

Next week I’ll continue this series by considering how you can control the type of CPU core threads run on, and some related issues about sparse bundles and the performance of Time Machine backup storage.

Tune for Performance: Do you really need a big internal SSD?

The most common economy that many make when specifying their next Mac is to opt for just 512 GB internal storage, and save the $/€/£600 or so it would cost to increase that to 2 TB. After all, if you can buy a 2 TB Thunderbolt 5 SSD for around two-thirds of the price, why pay more? And who needs the blistering speed of that expensive internal storage when your apps work perfectly well with a far cheaper SSD? This article considers whether that choice matters in terms of performance.

To look at this, I’m going to use the same compression task that I used in this previous article, my app Cormorant relying on the AppleArchive framework in macOS to do all the heavy lifting. As results are going to differ considerably when using other apps and other tasks, I’d like to make it clear that this can’t reach general conclusions that apply to every task and every app: your mileage will vary. My purpose here is to show how you can work out whether using a slower disk will affect your app’s performance.

Methods

In that previous article, I concluded that compression speed was unlikely to be determined by disk performance, as even with all 10 P cores running compression threads, that rate only reached 2.28 GB/s, around the same as write speed to a good Thunderbolt 3 SSD. For today’s tests, I therefore set Cormorant to use the default number of threads (all 14 on my M4 Pro) so it would run as fast as the CPU cores would allow when compressing my standard 15.517 GB IPSW test file.

I’m fortunate to have a range of SSDs to test, and here use the 2 TB internal SSD of my Mac mini M4 Pro, an external USB4 enclosure (OWC Express 1M2) with a 2 TB Samsung 990 Pro SSD, and an external 2 TB Thunderbolt 3 SSD (OWC Envoy Pro SX). As few are likely to have access to such a range, I included two disk images stored on the internal SSD, a 100 GB sparse bundle, and a 50 GB read-write disk image, to see if they could be used to model external storage. All tests used unencrypted APFS file systems.

The first step with each was to measure its write speed using Stibium. Unlike more popular benchmarking apps for the Mac, Stibium measures the speed across a wide range of file sizes, from 2 MB to 2 GB, providing more insight into performance. After those measurements, those test files were removed, the large test file copied to the volume, and compressed by Cormorant at high QoS with the default number of threads.

Disk write speeds

compressionbydisk1

Results in each test followed a familiar pattern, with rapidly increasing write speeds to a peak at a file size of about 200 MB, then a steady rate or slow decline up to the 2 GB tested. The read-write disk image was a bit more erratic, though, with high write speeds at 800 and 1000 MB.

At 2 GB file size, write speeds were:

  • 7.69 GB/s for the internal SSD
  • 7.35 GB/s for the sparse bundle
  • 3.61 GB/s for the USB4 SSD
  • 2.26 GB/s for the Thunderbolt 3 SSD
  • 1.35 GB/s for the disk image.

Those are in accord with my many previous measurements of write speeds for those types of storage.

Compression rates

Times to compress the 15.517 GB test file ranked differently:

  • 5.57 s for the internal SSD
  • 5.84 s for the USB4 SSD
  • 9.67 s for the sparse bundle
  • 10.49 s for the Thunderbolt 3 SSD
  • 16.87 s for the disk image.

When converted to compression rates, sparse bundle results are even more obviously an outlier, as shown in the chart below.

compressionbydisk2

There’s a roughly linear relationship between measured write speed and compression rate in the disk image, Thunderbolt 3 SSD, and USB4 SSD, and little difference between the latter and the internal SSD. These suggest that disk write speed becomes the rate-limiting factor for compression when write speed falls below about 3 GB/s, but above that faster disks make little difference to compression rate.

Poor performance of the sparse bundle was a surprise, given how close its write speeds are to those of the host internal SSD. This is probably the result of compression writing a single very large file across its 14 threads; as the sparse bundle stores file data on a large number of band files, their overhead appears to have got the better of it. I will return to look at this in more detail in the near future, as sparse bundles have become popular largely because of their perceived superior performance.

The difference in compression rates between USB4 and Thunderbolt 3 SSDs is also surprisingly large. Of course, a Mac with fewer cores to run compression threads might not show any significant difference: a base M3 chip with 4 P and 4 E cores is unlikely to achieve a compression rate much in excess of 1.5 GB/s on its internal SSD because of its limited cores, so the restricted write speed of a Thunderbolt 3 SSD may not there become the rate-limiting factor.

Conclusions

  • The rate-limiting step in task performance will change according to multiple factors, including the effective use of multiple threads on multiple cores, and disk performance.
  • There’s no simple model you can apply to assess the effects of disk performance, and tests using disk images can be misleading.
  • You can’t predict whether a task will be disk-bound from disk benchmark performance.
  • Even expensive high-performance external SSDs can result in noticeably poor task performance. Maybe that money would be better spent on a larger internal SSD after all.

Tune for Performance: do more threads run faster?

One of the most distinctive features about modern Macs is that they have multiple cores, in Apple silicon models a minimum of eight. For an app to be able to make use of more than one core (or its equivalent) at a time it needs to divide its processing into threads, discrete blocks of code that can be run on different cores by macOS. If it doesn’t do that, then running that app on a high-end Pro, Max or Ultra chip is unlikely to be significantly faster than on a base model (assuming that task is CPU-bound).

Activity Monitor

You might be able to get a good idea as to how well an app makes use of multiple cores from watching it in use in Activity Monitor’s CPU History window, but in many cases that isn’t conclusive, and needs to be confirmed.

To illustrate ways to tackle this, I take the example of file compression. Some methods lend themselves to multiple threads better than others, and some may be implemented in a way that won’t accelerate when several cores are available. For the occasional user this might make little difference, but if you were to spend much of your day waiting for 10-100 GB files to compress, it merits a little exploration.

polycore1

This CPU History window from a compression task run on an M4 Pro is typically unhelpful. A single file compression is seen in the group of 4-5 peaks in CPU in the right half of each trace. Although compression used all ten P cores, it appears to have been moved between P clusters, seen by comparing the timing of peaks on Cores 8 and 10, 9 and 11. At no time does the task appear to exceed 50% active residency on any of the cores, though. All you can do is guess as to what might be going on, and whether it might run faster on twelve or more P cores.

polycore2

On another run, the same task is reported to have reached a peak of 500% CPU in 12 threads, but would it exceed that on more P cores, given that 10 were available?

Timing performance

As usual, I’ve been cheating a little to generate those results, by using my simple compression-decompression app Cormorant, in which I control how many threads it uses. If the app you’re trying to investigate offers a similar feature, then you can set up a standard task, here the compression of a 15.517 GB IPSW file, and time how long it takes using different numbers of threads.

polycore3

The answers from Cormorant, which conveniently performs its own timing, are:

  • 1 thread takes 49.32 seconds
  • 2 take 26.74 s
  • 3 take 18.60 s
  • 4 take 14.29 s
  • 5 take 11.21 s.

So Cormorant’s compression can make good use of more P cores, although with diminishing returns.

Very few apps give you this level of control, though. The only other compression utility that does appear to is Keka.

polycore4

In its settings, you can give its tasks a maximum number of threads, and even run them at custom Quality of Service (QoS) if you want them to be run in the background on E cores, and not interrupt your work on P cores.

Controlling threads

There is one way that you can limit the effective number of threads used by an arbitrary app, and that’s to run it in a Virtual Machine, as you control the number of virtual cores that it uses. While you can just about run a macOS VM on a single core alone, I suggest that a more workable starting point is two cores, and you can increase that number up to the total of P cores in the host without causing problems.

VMs have other virtues, including their relative lack of background processes, allowing their virtual cores to be almost entirely devoted to your test task. However, as they can’t run apps from the App Store other than Apple’s free suite of Pages, Numbers and Keynote, that could prevent you from using them for testing.

To set up a VM for thread tests, I duplicated a standard Sonoma 14.7.1 VM so I could throw it away at the end, opened it with five virtual cores, and copied over the test app Cormorant and file. I then closed that VM down, set it to use 2 virtual cores, opened it and ran my test. I repeated that with an increasing number of virtual cores up to a total of 5. Compression times are:

  • 2 vCPUs take 30.13 seconds
  • 3 take 21.36 s
  • 4 take 17.18 s
  • 5 take 13.59 s.

Those are only slightly slower than their equivalents from the host tests above.

Analysis

polycore5

Plotting those results out using DataGraph, the lines of best fit follow power laws:

  • for real cores, time = 49.8863/(T^0.91169)
  • for virtual cores, time = 54.737/(T^0.854318)

where T is the number of threads. That explains the apparently diminishing returns with increasing numbers of threads, although the maths isn’t as simple as we might like.

A better way to look at this is by calculating the rate of compression in GB/s, simply by dividing the file size of 15.517 GB by each time. Here we end up with straight lines from linear regression, that are more amenable to thought.

polycore6

Those regressions are:

  • for real cores, rate of compression = 0.0464 + (0.264 x T)
  • for virtual cores, rate of compression = 0.102 + (0.206 x T)

which are more generally useful.

Tuning performance

This might appear over-elaborate and of little practical use, but we now have a much better understanding of the factors limiting compression performance:

  • The more threads compression uses, the shorter time a task will take.
  • There are limits to that improvement, though, when substantially more than 5 threads are used.
  • Compression rates achieved in M4 P cores are significantly lower than read or write speeds of the internal SSD, so aren’t likely to be limited by the speed of a faster SSD (TB3 or USB4).
  • As compression appears to be CPU-bound, faster P cores would also be expected to result in shorter times.
  • Improving the efficiency of the compression code could increase performance.
  • Compression in a VM runs at about 78% of speed on the host.

To see how reliable these are, I therefore repeated the Cormorant test using all 10 P cores on the host Mac, which took 6.80 seconds, a little more than half the time for 5 cores. That’s a compression rate of 2.28 GB/s, rather less than the 2.69 GB/s predicted by the linear regression. That’s now approaching the write speed of some TB3 SSDs, and the rate-limiting step could then change to be the write performance of the storage being used, rather than CPU cores.

❌