Reading view

There are new articles available, click to refresh the page.

Last Week on My Mac: Why Spotlight can’t find some files

For the last seven years or so there have been many folk complaining that Spotlight local search hasn’t been finding the files they know are there. Many have resorted to repeatedly rebuilding its indexes, usually without success. Last week, thanks to Jürgen, Drew, aldous and others who have contributed, we have discovered one cause. A bug that appears to have been introduced in macOS Mojave, and is still present in Tahoe 26.0.1, that prevents Spotlight from indexing any of the contents of plain text files that start with certain characters.

Jürgen stumbled across the first example, with files starting with the two capital letters LG. At the time, that seemed extremely unusual and unlikely to affect many files. Then Drew added HPA and Draw to the list of forbidden characters. What looked like a rare event was becoming increasingly commonplace, and that list can only grow. How many indexing failures it could account for is impossible to guess.

Piecing together the evidence, it looks like this bug is inside the standard macOS RichText.mdimporter, now embedded in the Signed System Volume in /System/Library/Spotlight and at version 6.9 (350), as it has been since Sonoma (Ventura 13.7.8 has build 345.60.106, although that also suffers this bug). What happens is that saving a text file starting with forbidden characters correctly triggers Spotlight’s indexing service. That identifies the file as having the UTI public.plain-text and hands it over for its contents to be indexed. But the indexer inspects those first few characters, decides it’s a different type of file altogether, and promptly returns an error 4864 for an NSCoderReadCorruptError without going any further.

Apart from the text content not being added to Spotlight’s indexes, and a few lines buried in the Unified log, there are no indications of anything going wrong. If you test the importer using
mdimport -t -d3 filename
the file appears to import correctly, but that command doesn’t give any insight into the import of its contents, only standard attributes such as the filename that are indexed separately.

It was Drew who first suggested a plausible reason for this failure, confirmed by aldous: prior to attempting to index the text contents, Spotlight’s service was using a completely different method to check the type of the contents, the ‘magic’ database used by the file(1) command.

file(1) is an old Unix utility dating back to 1973 or earlier, operating independently of UTIs that were adopted in Mac OS X 10.4 Tiger 20 years ago. Rather than relying on a type assigned to a file, it ‘sniffs’ the contents, particularly the first few bytes of data, and uses a sprawling set of ad hoc rules to guess the file type. It turns out that files starting with the characters Draw were characteristic of a binary vector graphics format used by the !Draw app for RISC OS 2 in 1989. Rather than believing the file’s UTI for one of the most common types of files in macOS, Spotlight’s indexer therefore decided that it was trying to import file data that must now be as rare as hens’ teeth, and wouldn’t go any further.

If you’re sceptical about this coincidence, open the acorn magic data in /usr/share/file/magic in a text editor, and you’ll see the file opening string of Draw identified as RISC OS Draw file data. There are 332 other magic data files containing similar rules for identifying file types. I leave it as an exercise to the Unix wizard to build a list of all those that could cause similar problems with Spotlight indexing.

When this bug hunt started and it affected just LG and HPA, it was fairly esoteric and faintly amusing, at least as long as you didn’t write about your LG TV, high pressure air or Horizontal Pod Autoscaling. When Draw was added, and all those 333 magic files piled in, I realised how extensive this could be, and how little testing can be performed on Spotlight indexing and search.

Given that about eight years ago an Apple engineer wrote code for the RichText.mdimporter in macOS that introduced testing against some or all of the magic database, wouldn’t you have thought they’d test and debug that against test cases, such as text files starting with characters (mis)recognised by magic rules? And maybe occasionally over subsequent years and new versions of macOS, wouldn’t revised versions of the importer be tested again?

Apple likes Spotlight to be opaque to the user, for it to ‘just work’. There’s almost no documentation even for developers, and tools provided are strictly limited in what they can do, as demonstrated here in the case of mdimport. That’s all very well until Spotlight doesn’t work and no one outside Apple can do anything about it. Third-parties can’t even write custom mdimporters to do the job properly, as those bundled in macOS take priority.

If this was the first time that Spotlight indexing had let us down, I might feel more charitable. But between macOS Catalina 10.15.6 in July 2020 and Big Sur 11.3 in April 2021 macOS was incapable of indexing the content of any Rich Text files. There are still many documents that haven’t been indexed as a result. Those whose contents haven’t been indexed as a result of this bug will similarly be excluded from search until they too are reindexed by a fixed mdimporter. For Intel Macs that won’t be supported by macOS 27, that could well be forever.

❌