How do Live Text and Visual Look Up work now?
Live Text and Visual Look Up are recent features in macOS, first appearing in Monterey nearly four years ago. As that immediately followed Apple’s short debacle over its now-abandoned intention to scan images for CSAM, most concerns have been over whether these features send details of images to Apple.
Although recent Intel Macs also support both these features, they don’t have the special hardware to accelerate them, so are far slower. For this walkthrough, I’ll only present information from Apple silicon Macs, in particular for the M4 Pro chip in a Mac mini.
Initiation
When an image is opened from disk, the VisionKit system starts up early, often within 0.1 second of it starting to open. Its initial goal is to segment the image according to its content, and identify whether there’s any part of it that could provide text. If there is, then Live Text is run first so you can select and use that as quickly as possible.
In the log, first mention of this comes from com.apple.VisionKit announcingSetting DDTypes: All, [private]
Cancelling all requests: [private]
Signpost Begin: "VKImageAnalyzerProcessRequestEvent"
It then adds a request to Mad Interface, that’s the mediaanalysisd
service, with a total method return time, and that’s processed with another signpost (a regular log entry rather than a Signpost):Signpost Begin: "VisionKit MAD Parse Request"
com.apple.mediaanalysis receives that request[MADServicePublic] Received on-demand image processing request (CVPixelBuffer) with MADRequestID
then runs thatRunning task VCPMADServiceImageProcessingTask
This is run at a high Quality of Service of 25, termed userInitiated, for tasks the user needs to complete to be able to use an app, and is scheduled to run in the foreground. Next Espresso creates a plan for a neural network to perform segmentation analysis, and the Apple Neural Engine, ANE, is prepared to load and run that model. There’s then a long series of log entries posted for the ANE detailing its preparations. As this proceeds, segmentation may be adjusted and the model run repeatedly. This can involve CoreML, TextRecognition, Espresso and the ANE, which can appear in long series of log entries.
Text recognition
With segmentation analysis looking promising for the successful recognition of text in the image, LanguageModeling prepares its model by loading linguistic data, marked by com.apple.LanguageModeling reportingCreating CompositeLanguageModel ([private]) for locale(s) ([private]): [private]
NgramModel: Loaded language model: [private]
and the appearance of com.apple.Lexicon. An n-gram is an ordered sequence of symbols, that could range from letters to whole words, and the lexicon depends on the language locales. In my case, two locales are used, en
(generic English) and en_US
(US English).
At this point, mediaanalysisd
declares the VCPMADVIDocumentRecognitionTask is complete, and runs a VCPMADVIVisualSearchGatingTask, involving com.apple.triald, TRIClient, PegasusKit and Espresso again. Another neural network is loaded into the ANE, and is run until the VCPMADVIVisualSearchGatingTask is complete and text returned.
The next task is for the translationd
service to perform any translations required. If that’s not needed, VisionKit reportsTranslation Check completed with result: NO, [private]
These may be repeated with other image segments until all probable text has been discovered and recognised. At that stage, recognised text is fully accessible to the user.
Object recognition
Further image analysis is then undertaken on segments of interest, to support Visual Look Up. Successful completion of that phase is signalled in Preview by the addition of two small stars at the upper left of its ⓘ Info tool. That indicates objects of interest can be identified by clicking on that tool, and isn’t offered when only Live Text has been successful. VisionKit terms those small buttons Visual Search Hints, and remains paused until the user clicks on one.
Visual Search Hints
Clicking on a Visual Search Hint then engages the LookupViewService, and changes the state from Configured to Searching. VisionKit records anotherSignpost Begin: "VisionKit MAD VisualSearch Request"
and submits a request to MAD for image processing to be performed. If necessary, the kernel then brings all P cores online in preparation, and the ANE is put to work with a new model.
PegasusKit then makes the first off-device connection for assistance in visual search:Querying https: // api-glb-aeus2a.smoot.apple.com/apple.parsec.visualsearch.v2.VisualSearch/VisualSearch with request (requestId: [UUID]) : (headers: ["grpc-message-type": "apple.parsec.visualsearch.v2.VisualSearchRequest", "Content-Type": "application/grpc+proto", "grpc-encoding": "gzip", "grpc-accept-encoding": "gzip", "X-Apple-RequestId": "[UUID]", "grpc-timeout": "10S", "User-Agent": "PegasusKit/1 (Mac16,11; macOS 15.6 24G84) visualintelligence/1"]) [private]
When the visual search task completes, information is displayed about the object of interest in a floating window. Visual Look Up is then complete, and in the absence of any further demands, the ANE may be shut down to conserve energy, and any inactive CPU cluster likewise:ANE0: power_off_hardware: Powering off... done
PE_cpu_power_disable>turning off power to cluster 2
Key points
- Image analysis starts shortly after the image is loaded.
- Central components are VisionKit and
mediaanalysisd
, MAD. - In Apple silicon Macs, extensive use is made of the Apple Neural Engine throughout, for neural network modelling.
- Most if not all is run at high QoS of 25 and in the foreground, for performance.
- Segmentation analysis identifies areas that might contain recoverable text.
- Segments are then analysed, using language modelling for appropriate locales and languages, and lexicons, to return words rather than fragmented characters.
- When Live Text is ready, segments are then analysed for recognisable objects. When that’s complete, each is marked by a Visual Search Hint.
- Clicking on a Visual Search Hint initiates a network connection to provide information about that object, displayed in a floating window.