Reading view

There are new articles available, click to refresh the page.

Encryption and checking hashes slows faster SSDs

It’s commonly claimed that software encryption, as used in APFS Encrypted format, incurs negligible overhead. The last time I looked at that was with Thunderbolt 3 SSDs connected to a Mac Studio M1 Max, when I found that varied according to the SSD. One of the three I tested then did show significant reductions in encrypted write speed, from 2.2 to 1.8 GB/s, but the fastest showed no change from its unencrypted write speed of 2.8 GB/s. This article reports new test results from a Mac mini M4 Pro with faster SSDs, one Thunderbolt 5 and the other USB4, and adds data for computing SHA256 hashes.

These are of particular interest, as not only are the unencrypted transfer speeds for both SSDs significantly higher than Thunderbolt 3, but the host has significantly faster CPU cores.

Two sets of measurements were made on each of the two SSDs:

  • Stibium 1.2 running on macOS 15.5 Sequoia was used to measure read and write speeds over randomised sequences of a total of 53 GB in 160 files of 2 MB to 2 GB individual size.
  • Stibium was used to measure the single file read speed of a 16.8 GB IPSW file, and Dintch was used to measure the time taken to stream the file in and compute its SHA256 digest, using CryptoKit.

Read and write speeds

Results of the first series of tests showed both SSDs performed as expected when using plain APFS, with read and write speeds of 5.3 GB/s for TB5, and 3.7 GB/s for USB4.

Small reductions in read speed were seen in both SSDs when using APFS Encrypted, to about 98% and 95% of their unencrypted read speed. Although there was a similar small reduction in write speed for USB4, to 97%, that seen in the Thunderbolt 5 SSD was greater, with a fall from 5.3 to 4.7 GB/s (89%). Both sets of tests were repeated for that SSD, allowing ample time for the SLC cache to be emptied after each set of write tests, and results remained essentially the same.

Although write speed to APFS Encrypted for this Thunderbolt 5 SSD remained well above that for USB4, encryption brought a reduction in speed of just over 10%, more than I had anticipated.

Hash computation

SHA256 and SHA512 digests are now used to check file data integrity. Both are computationally intensive, and I have previously reported that reading files of substantial size and computing their digests using CryptoKit proceeds at about 3 GB/s for files stored on the fast internal SSD of a Mac mini M4 Pro.

With the Thunderbolt 5 SSD, a plain file of 16.8 GB was read at 6.5 GB/s, and encrypted at 4.7 GB/s. SHA256 digest computation was performed at 2.6 GB/s from plain APFS, and 2.2 GB/s from APFS Encrypted, both well below that from the internal SSD, and less than half the speed of just reading the file.

Although the USB4 SSD was inevitably slower on the read tests, at 3.8 GB/s, encryption had little effect, at 3.7 GB/s. SHA256 digest computation was, if anything, faster than with Thunderbolt 5, at 2.8 GB/s plain, and 2.7 GB/s encrypted.

Conclusions

Although there may well be differences with other Thunderbolt 5 and USB4 SSDs, and more extensive results would be helpful:

  • Whether plain or encrypted APFS, Thunderbolt 5 SSDs are substantially faster than USB4.
  • Encryption can result in significantly lower write speeds on some Thunderbolt 5 SSDs.
  • Otherwise, encryption has only small effects on read and write speeds.
  • Computation of SHA256 digests is significantly slower than encryption, and ranges between 2.2-2.8 GB/s on larger files.
  • This suggests that, even in faster M4 chips, CPU performance limits the speed of software encryption, and even more so for SHA256 digest computation.

Checking data integrity

How can you check whether important data has acquired errors? It could be in packets transferred over a network, files being stored in the cloud, or those put in an archive for safe-keeping. This article explains how errors can be detected and in some cases corrected.

Checksum

One quick and simple way to keep track of whether errors have crept in is to add all the byte values together using modulo arithmetic, and see if that number matches what we expect. Modulo addition ensures the sum never exceeds 255 (for 8-bit bytes), but it also means there’s a 1 in 255 chance that the checksum may remain correct even when the data is wrong. There’s an even more serious problem, though, as changing the order of bytes in the data won’t have any effect on their checksum, even though it would scramble the data.

One solution to those is the Fletcher checksum, using two values instead of one. Both start at zero, then the value of the first block of data is added to the first of those. That is then added to the second value, and each time another value is added to the first, their total is added to the second value. At the end the two values are combined to give the Fletcher checksum.

As this only uses modulo addition, it’s extremely quick, so is used with 32-bit blocks for the Fletcher-64 checksum in APFS file system metadata. The chances of a ‘collision’, in which an error fails to show up in the checksum, are almost 1 in 4.3 billion. Wikipedia’s account includes worked examples.

Cyclic redundancy check

These use additional bits to form cyclic codes based on the remainder of a polynomial division of the data contents. Although they sound complex, they use simple binary operations, so can be very quick in use. Again, Wikipedia’s account includes worked examples.

These were originally developed for use over communication channels such as radio, where they are effective against short bursts of errors. Since then they have been widely used in hard disks and optical storage. They have two main disadvantages, in that they can easily be reversed, allowing the original data to be reconstructed, and they can’t protect against intentional modifications.

Hash

Secure Hash Algorithms, SHA, are a family of cryptographic functions based on a one-way compression function. The first, SHA-1, produces a 160-bit value referred to as the digest. Used from 1995, it was broken in 2005, and has been replaced by SHA-2 using digest sizes of 224-512 bits, now most widely used as SHA-256.

For SHA-256, data is processed in 512-bit chunks and subjected to 64 rounds of a compression function before being appended into the 256-bit digest, while SHA-512 uses 1024-bit chunks and 80 rounds. Full details are again given in Wikipedia.

Important properties of cryptographic hash functions include:

  • There’s a one-to-one mapping between input data and hash, so the same data always generates the same hash.
  • It’s not feasible to work out the input data for any given hash, making the mapping one-way.
  • Collisions are so rare as to not occur in practice.
  • Small changes in the input data result in large changes in the hash, so amplifying any differences.
  • Hash values should be fairly evenly distributed.

SHA-256 and related hashes are used in code signatures, as CDHashes of the protected parts of each app, bundle, etc. They’re also used to verify the integrity of the Signed System Volume in modern macOS, where they’re assembled into a tree hierarchy so they can be verified lazily, on-demand. More generally, cryptographic hashes are used in message authentication codes (MAC) to verify data integrity in TLS (formerly SSL).

Error-correcting code

Those methods of detecting errors can’t, in general, be used to correct them. The first effective error-correcting code was devised by Richard Hamming, and is named after him. It can correct all single-bit errors, and will detect two-bit errors as well. Wikipedia’s excellent account, complete with explanatory Venn diagrams, is here.

Ten years after Hamming code came Reed-Solomon code (R-S), invented by Irving S Reed and Gustave Solomon. Nearly twenty years later, when Philips Labs were developing the format for CDs, their code was adopted to correct errors in their reading. Unlike others, when used in CDs, R-S codes are applied to bytes rather than bits, in two steps.

The first encodes 24 B of input data into 28 B of code. Interleaving is then performed on those codewords in blocks of 28, or 784 B of codewords, following which a second R-S coding is performed to convert each 28 B into 32 B of code. The overall code rate is thus 24/32, so an input file grows by a third following this double encoding. R-S code is explained in detail in this Wikipedia article.

The reason for such a complex scheme of error-correction in CDs is to correct bursts of errors up to 4 KB, or 500 bytes, representing about 2.5 mm of the track on a CD. Similar methods have been used for DVDs and in parchive files, which were distributed in USENET posts. However, it becomes progressively harder and less efficient to provide error-correction for larger blocks of missing data, which is of course one of the most serious problems in computer storage systems.

macOS

Unlike some file systems including Zfs, APFS and macOS don’t provide any native method for checking the integrity of data stored in files, although macOS does offer SHA-256 and SHA-512 support in CryptoKit. My free suite of two apps (Dintch and Fintch) and a command tool (cintch) offer a robust scheme using CryptoKit’s SHA-256 that I and others have been using for the last five years. Details are in their Product Page.

❌