Reading view

There are new articles available, click to refresh the page.

What happens to images when you disable Live Text?

For many of us, Live Text and Visual Look Up are real boons, making it simple to copy text from images, to perform Optical Character Recognition (OCR) on images of text, and to identify objects of interest. They are also used by the mediaanalysisd service to extract text and object identifications for indexing by Spotlight, making those image contents searchable. Although the latter can be disabled generally or for specific folders in Spotlight settings, there appears to be only one control over Live Text, and none for Visual Look Up. This article examines what that single control does, using log extracts obtained from a Mac mini M4 Pro running macOS Sequoia 15.6. It follows my recent article on how these features work.

Setting

Live Text, which is enabled by default, can be disabled in Language & Region settings. When that is turned off, opening an image containing text in Preview no longer makes any text selectable. However, Visual Look Up still works as normal.

Textual images

When an image containing recognisable text but no other objects is opened in Preview, the VisionKit subsystem is still activated soon after the image is loaded. VisionKit initially reports that the “device” supports analysis, but immediately clarifies that to
Device does not support image analysis, but does support Visual Search, limiting to just Visual Search.
It then starts a VKImageAnalyzerProcessRequestEvent with a MAD parse request. That leads to a Visual Search Gating Task being run, and the Apple Neural Engine (ANE) and all CPU P cores are prepared for that.

Less than 0.1 second later, the end of the VKImageAnalyzerProcessRequestEvent is reported, and VisionKit returns an analysis that no image segments merits further analysis. Preview’s ⓘ Info button remains in its normal state, and clicking on that doesn’t alter the image displayed.

Images with other objects

An image containing potentially recognisable objects doesn’t stop there. If VisionKit returns an analysis indicating Visual Search could extract objects from the image, the ⓘ Info button adds stars and waits for the user to open the Info window.

VisionKit reports in the log
Setting Active Interaction Types: [private], [private]
then that it
DidShowVisualSearchHints with invocationType: VisualSearchHintsActivated, id: 1

When one of those Visual Search Hints is clicked, the Lookup View is prepared, followed by a notice from LookupViewService that it’s Changing state from LVSDisplayStateConfigured to LVSDisplayStateSearching. That leads to VisionKit making a Visual Search request from mediaanalysisd.

After the Apple Neural Engine (ANE) is run to progress that, successful search results in PegasusKit making its internet connection to identify the object, exactly as it does when Live Text is enabled and text has been recovered from the image:
Querying https: // api-glb-aeuw1b.smoot.apple.com/apple.parsec.visualsearch.v2.VisualSearch/VisualSearch with request (requestId: [UUID]) : (headers: ["Content-Type": "application/grpc+proto", "grpc-encoding": "gzip", "grpc-accept-encoding": "gzip", "grpc-message-type": "apple.parsec.visualsearch.v2.VisualSearchRequest", "X-Apple-RequestId": "[UUID]", "User-Agent": "PegasusKit/1 (Mac16,11; macOS 15.6 24G84) visualintelligence/1", "grpc-timeout": "10S"]) [private]

The ANE is finally cleaned up and shut down as the search results are displayed in the Lookup View.

Conclusions

  • When Live Text is disabled in Language & Region settings, images are still analysed when they are opened, to determine if they’re likely to contain objects that can be recognised in Visual Search for Visual Look Up.
  • If there are no such objects detected, VisionKit proceeds no further.
  • If there are suitable objects, mediaanalysisd and VisionKit proceed to identify them using Visual Search Hints, as normal.
  • If the user clicks on a Visual Search Hint, PegasusKit connects to Apple’s servers to identify that object and provide information for display in the Lookup View.
  • Although there is less extensive use of the ANE and CPU cores than when Live Text is enabled, neural networks are still run locally to perform a more limited image analysis.

I’m very grateful to Benjamin for pointing out this control over Live Text.

How do Live Text and Visual Look Up work now?

Live Text and Visual Look Up are recent features in macOS, first appearing in Monterey nearly four years ago. As that immediately followed Apple’s short debacle over its now-abandoned intention to scan images for CSAM, most concerns have been over whether these features send details of images to Apple.

Although recent Intel Macs also support both these features, they don’t have the special hardware to accelerate them, so are far slower. For this walkthrough, I’ll only present information from Apple silicon Macs, in particular for the M4 Pro chip in a Mac mini.

Initiation

When an image is opened from disk, the VisionKit system starts up early, often within 0.1 second of it starting to open. Its initial goal is to segment the image according to its content, and identify whether there’s any part of it that could provide text. If there is, then Live Text is run first so you can select and use that as quickly as possible.

In the log, first mention of this comes from com.apple.VisionKit announcing
Setting DDTypes: All, [private]
Cancelling all requests: [private]
Signpost Begin: "VKImageAnalyzerProcessRequestEvent"

It then adds a request to Mad Interface, that’s the mediaanalysisd service, with a total method return time, and that’s processed with another signpost (a regular log entry rather than a Signpost):
Signpost Begin: "VisionKit MAD Parse Request"

com.apple.mediaanalysis receives that request
[MADServicePublic] Received on-demand image processing request (CVPixelBuffer) with MADRequestID
then runs that
Running task VCPMADServiceImageProcessingTask

This is run at a high Quality of Service of 25, termed userInitiated, for tasks the user needs to complete to be able to use an app, and is scheduled to run in the foreground. Next Espresso creates a plan for a neural network to perform segmentation analysis, and the Apple Neural Engine, ANE, is prepared to load and run that model. There’s then a long series of log entries posted for the ANE detailing its preparations. As this proceeds, segmentation may be adjusted and the model run repeatedly. This can involve CoreML, TextRecognition, Espresso and the ANE, which can appear in long series of log entries.

Text recognition

With segmentation analysis looking promising for the successful recognition of text in the image, LanguageModeling prepares its model by loading linguistic data, marked by com.apple.LanguageModeling reporting
Creating CompositeLanguageModel ([private]) for locale(s) ([private]): [private]
NgramModel: Loaded language model: [private]

and the appearance of com.apple.Lexicon. An n-gram is an ordered sequence of symbols, that could range from letters to whole words, and the lexicon depends on the language locales. In my case, two locales are used, en (generic English) and en_US (US English).

At this point, mediaanalysisd declares the VCPMADVIDocumentRecognitionTask is complete, and runs a VCPMADVIVisualSearchGatingTask, involving com.apple.triald, TRIClient, PegasusKit and Espresso again. Another neural network is loaded into the ANE, and is run until the VCPMADVIVisualSearchGatingTask is complete and text returned.

The next task is for the translationd service to perform any translations required. If that’s not needed, VisionKit reports
Translation Check completed with result: NO, [private]
These may be repeated with other image segments until all probable text has been discovered and recognised. At that stage, recognised text is fully accessible to the user.

Object recognition

Further image analysis is then undertaken on segments of interest, to support Visual Look Up. Successful completion of that phase is signalled in Preview by the addition of two small stars at the upper left of its ⓘ Info tool. That indicates objects of interest can be identified by clicking on that tool, and isn’t offered when only Live Text has been successful. VisionKit terms those small buttons Visual Search Hints, and remains paused until the user clicks on one.

Visual Search Hints

Clicking on a Visual Search Hint then engages the LookupViewService, and changes the state from Configured to Searching. VisionKit records another
Signpost Begin: "VisionKit MAD VisualSearch Request"
and submits a request to MAD for image processing to be performed. If necessary, the kernel then brings all P cores online in preparation, and the ANE is put to work with a new model.

PegasusKit then makes the first off-device connection for assistance in visual search:
Querying https: // api-glb-aeus2a.smoot.apple.com/apple.parsec.visualsearch.v2.VisualSearch/VisualSearch with request (requestId: [UUID]) : (headers: ["grpc-message-type": "apple.parsec.visualsearch.v2.VisualSearchRequest", "Content-Type": "application/grpc+proto", "grpc-encoding": "gzip", "grpc-accept-encoding": "gzip", "X-Apple-RequestId": "[UUID]", "grpc-timeout": "10S", "User-Agent": "PegasusKit/1 (Mac16,11; macOS 15.6 24G84) visualintelligence/1"]) [private]

When the visual search task completes, information is displayed about the object of interest in a floating window. Visual Look Up is then complete, and in the absence of any further demands, the ANE may be shut down to conserve energy, and any inactive CPU cluster likewise:
ANE0: power_off_hardware: Powering off... done
PE_cpu_power_disable>turning off power to cluster 2

Key points

  • Image analysis starts shortly after the image is loaded.
  • Central components are VisionKit and mediaanalysisd, MAD.
  • In Apple silicon Macs, extensive use is made of the Apple Neural Engine throughout, for neural network modelling.
  • Most if not all is run at high QoS of 25 and in the foreground, for performance.
  • Segmentation analysis identifies areas that might contain recoverable text.
  • Segments are then analysed, using language modelling for appropriate locales and languages, and lexicons, to return words rather than fragmented characters.
  • When Live Text is ready, segments are then analysed for recognisable objects. When that’s complete, each is marked by a Visual Search Hint.
  • Clicking on a Visual Search Hint initiates a network connection to provide information about that object, displayed in a floating window.

❌