Last Week on My Mac: A strategy for data integrity
File data integrity is one of those topics that never goes away. Maybe that’s because we’ve all suffered in the past, and can’t face a repeat of that awful feeling when an important document can’t be opened because it’s damaged, or crucial data have gone missing. Before considering what we could do prevent that from happening, we must be clear about how it could occur.
We have an important file, and immediately after it was last changed and saved, a SHA256 digest was made of it and saved to that file as an extended attribute, in the way that you can using Dintch, Fintch or cintch
. A few days or weeks later we open the file and discover its contents have changed.
Reasons
What could account for that?
One obvious reason is that the file was intentionally changed and saved without updating its digest. Provided there are good backups, we should be able to step back through them to identify when that change occurred, and decide whether it’s plausible that it was performed intentionally. Although the file’s Modified datestamp should coincide with the change seen in its backups, there’s no way of confirming that change was intentional, or even which app was used to write the changed file (with some exceptions, such as PDF).
Exactly the same observations would also be consistent with the file being unintentionally changed, perhaps as a result of a bug in another app or process that resulted in it writing to the wrong file or storage location. The changed digest can only detect the change in file content, and can’t indicate what was responsible. This is a problem common to file systems that automatically update their own records of file digests, as they are unable to tell whether the change is intentional, simply that there has been a change. This also applies to changes resulting from malicious activity.
The one circumstance in which change in contents, hence in digest, wouldn’t necessarily follow a change in the file’s Modified datestamp is when an error occurs in the storage medium. However, this is also the least likely to be encountered in modern storage media without that error being reported.
Errors occurring during transfer to and from storage are detected by CRC or similar checks made as part of the transfer protocol. This is one of the reasons why a transfer bandwidth of 40 Gb/s cannot realise a data transfer rate of 5 GB/s, because part of that bandwidth is used by the error-checking overhead. Once written to a hard disk or SSD, error-correcting codes are used to verify integrity of the data, and are used to detect bad storage blocks.
Out of interest, I’ve been conducting a long-term experiment with 97 image files totalling 60.8 MB stored in my iCloud Drive since 11 April 2020, over five years ago. At least once a year I download them all and check them using Dintch, and so far I haven’t had a single error.
Datestamps
There are dangers inherent in putting trust in file datestamps as markers of change.
In APFS, each file has four different datestamps stored in its attributes:
create_time
, time of creation of that file,mod_time
, time that file was last modified,change_time
, time that the file’s attributes including extended attributes were last modified,access_time
, time that file was last read.
For example, a file with the following datestamps
create_time
2025-04-18 19:58:48.707+0100mod_time
2025-04-18 20:00:56.134+0100change_time
2025-07-19 06:59:10.542+0100access_time
2025-07-19 06:52:17.504+0100
was created on 18 April this year, last modified a couple of minutes later, last had its attributes changed on 19 July, but was last read 7 minutes before that modification to its attributes.
These can be read using Precize, or in Terminal, but there’s a catch with access_time
. APFS has an optional feature, set by volume, determining whether access_time
is changed strictly. If that option is set, then every time a file is accessed, whether it’s modified or not, its access_time
is updated. However, this defaults to only updating access_time
if its current value is earlier than mod_time
. I’m not aware of any current method to determine whether the strict access_time
is enabled for any APFS volume, and it isn’t shown in Disk Utility.
mod_time
can be changed when there has been no change in the file’s data, for example using the Terminal command touch
. Any of the times can be altered directly, although that should be very unusual even in malware.
Although attaching a digest to a file as an extended attribute will update its change_time
, there are many other reasons for that being changed, including macOS adding or changing quarantine xattrs, the file’s ‘last used date’, and others.
Proposed strategy
- Tag folders and files whose data integrity you wish to manage.
- Back them up using a method that preserves those tags, such as Time Machine, or copied to iCloud Drive.
- Periodically Check their tags to verify their integrity.
- As soon as possible after any have been intentionally modified and saved, Retag them to ensure their tags are maintained.
- In the event that any are found to have changed, and no longer match their tag, trace that change back in their backups.
Unlike automatic integrity-checking built into a file system, this will detect all unexpected changes, regardless of whether they are made through the file system, are intentional or unintentional, are malicious, or result from errors in storage media or transmission. Because only intentionally changed files are retagged, this also minimises the size of backups.