Quirks of Spotlight local search
Over the last few weeks, as I’ve been digging deeper into Spotlight local search, what seemed at first to be fairly straightforward has become far more complex. This article draws together some lessons that I have learned.
Apple’s search patents have been abandoned
Having tracked down a batch of Apple’s patents relating to search technologies that are likely to have been used in Spotlight, I was surprised to see those have been abandoned. For example, US Patent 2011/0113052 A1, Query result iteration for multiple queries, filed by John Hörnkvist on 14 January 2011, was abandoned five years later. I therefore have no insights to offer based on Apple’s extant patents.
Search is determined by index structure
Spotlight indexes metadata separately from contents, and both types of index point to files, apparently through their paths and names, rather than their inodes. You can demonstrate this using test file H in SpotTest. Once Spotlight has indexed objects in images discovered by mediaanalysisd
, moving that file to a different folder breaks that association immediately, and the same applies to file I whose text is recognised by Live Text.
Extracted text, that recovered using optical character recognition (Live Text), and object labels obtained using image classification (Visual Look Up) are all treated as content rather than metadata. Thus you can search for content, but you can’t obtain a list of objects that have been indexed from images, any more than you can obtain Spotlight’s lexicon of words extracted as text.
Language
Spotlight’s indexes are multilingual, as demonstrated by one of Apple’s earliest patents for search technology. Extracted text can thus contain words in several languages, but isn’t translated. Object labels are likely to be in the primary language set at the time, for example using the German word weide instead of the English cattle, if German was set when mediaanalysisd
extracted object types from that image. You can verify this in SpotTest using test file H and custom search terms.
If you change your Mac’s primary language frequently, this could make it very hard to search for objects recognised in images.
Search method makes a difference
The Finder’s Find feature can be effective, but has a limited syntax lacking OR
and NOT
unless you resort to using Raw Query predicates (available from the first popup menu). This means it can’t be used to construct a search for text containing the word cattle OR
cow. This has a beneficial side-effect, in that each term used should reduce the number of hits, but it’s a significant constraint.
The Finder does support some search types not available in other methods such as mdfind
. Of the image-related types, John reports that kMDItemPhotosSceneClassificationLabels
can be used in the Finder’s Find and will return files with matching objects that have been identified, but that doesn’t work in mdfind
, either in Terminal or when called by an app. Other promising candidates that have proved unsuccessful include:
- kMDItemPhotosSceneClassificationIdentifiers
- kMDItemPhotosSceneClassificationMediaTypes
- kMDItemPhotosSceneClassificationSynonyms
- kMDItemPhotosSceneClassificationTypes.
One huge advantage of mdfind
is that it can perform a general search for content using wildcards, in the form(** == '[searchTerm]*'cdw)
Using NSMetadataQuery from compiled code is probably not worth the effort. Not only does it use predicates of different form from mdfind
, but it’s unable to make use of wildcards in the same way that mdfind
can, a lesson again demonstrated in SpotTest. mdfind
can also be significantly quicker.
For example, you might use the formmdfind "(kMDItemKeywords == '[searchTerm]*'cdw)"
in Terminal, or from within a compiled app. The equivalent predicate for NSMetadataQuery would read(kMDItemKeywords CONTAINS[cdw] \"cattle\")
Another caution when using NSMetadataQuery is that apps appear to have their own single NSMetadataQuery instance on their main thread. That can lead to new queries leaking into the results from previous queries.
Key points
- Spotlight indexes metadata separately from contents.
- Text recovered from images, and objects recognised in images, appear to be indexed as contents. As a result you can’t obtain a lexicon of object types.
- Indexed data appear to be associated with the file’s path, and will be lost if a file is moved within the same volume.
- Text contents aren’t translated for indexing, so need to be searched for in their original language.
- Object types obtained from images appear to be indexed using terms from the primary language at the time they are indexed. If the primary language is changed, that will make it harder to search for images by contents.
- The Finder’s Find is constrained in its logic, and doesn’t support
OR
orNOT
, although using Raw Queries can work around that. mdfind
is most powerful, including wildcard search for content.- NSMetadataQuery called from code uses a different predicate format and has limitations.
I’m very grateful to Jürgen for drawing my attention to the effects of language, and to John for reporting his discovery of kMDItemPhotosSceneClassificationLabels.