Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Textovert 1.1 can now convert PDF files to other formats

By: hoakley
1 December 2025 at 15:30

As promised last week, I have now produced a new version of Textovert that can extract text from PDF files and convert that to any of the nine formats supported by the app. Testing here suggests this could be generally useful, as the quality of output files appears good, and worth the small effort in conversion.

This new version offers the same conversions as the first, using textutil, but handles PDF files with a .pdf extension (case-insensitive) differently. When converting them to plain text, it loads the PDF and uses Quartz 2D’s PDF engine to extract the text for saving as a text file. When the output format is set to Rich Text (RTF), it uses the same engine to extract styled text and saves that as an RTF file. Note that doesn’t include layout information, but is generally a fairly faithful representation of the styles used in the original.

For the seven other output formats, Textovert first extracts styled text into a temporary RTF file, then hands that over to textutil to convert it to the selected output format.

Each PDF conversion is handled in a separate thread running at a high QoS in the background, to avoid blocking the main thread. As large conversions can take many seconds or even minutes to complete, Textovert’s window tracks how many are running at the moment. That’s most useful when converting batches of PDFs, when it’s easy to forget the last one or two that are still in progress.

Because each conversion gets its own thread, multiple simultaneous conversions will occupy as many CPU cores as are available, as shown in this CPU History for my seven heavyweight test PDFs. At the left of each chart the CPU % rises rapidly as all seven conversion threads are active. As those complete, the bursts of CPU activity diminish until they are from the single thread converting the largest of the PDFs.

Among those test PDFs are:

  • A 527-page book of 10.9 MB
  • A 5,754-page ISA reference of 14.7 MB
  • An 867-page book of 18 MB
  • A 141-page software manual of 24.4 MB
  • A 12,940-page reference manual created using FrameMaker 2019 and Adobe Acrobat Distiller 23.0 on Windows of 76.6 MB size, © 2013-2023 Arm Limited.

To give you an idea of the quality of output, this is a tiny excerpt of the last of those in its original PDF:

And this is the webarchive output from Textovert viewed in Safari:

Converting PDFs does require significantly more memory than those performed by textutil alone. For most documents of more modest size, 100-500 MB is usual, but my monster test PDFs usually rise toward 5 GB during their conversion. I have checked this version for memory leaks, and although it can hold onto some memory longer than I would have expected, that doesn’t continue to rise, and no leak is apparent.

Because PDF conversions are more intricate, I have added extensive error-reporting. For example, if you try to convert a PDF containing scanned images without any recognised text, that won’t have any recoverable text available, as will be reported in the main window. Once conversion is complete, Textovert tries to delete the intermediate RTF file from temporary storage, and if that fails, you’ll be warned.

Textovert version 1.1 for macOS 14.6 and later is now available from here: textovert11
from Downloads above, and from its Product Page.

I hope you find it useful.

Explainer: Text formats

By: hoakley
28 November 2025 at 15:30

textutil, and Textovert its wrapper, convert between nine different formats, most of them in widespread use for documents that are largely based on text. This article explains a little about each of them, and its sequel tomorrow looks at how PDF differs. In each case, I give an example file size for a document containing the words
This is a test.
in a total of 15 characters.

Plain text

Conventionally, plain text files in macOS are most usually encoded using Unicode UTF-8, requiring just 15 bytes for the hex bytes
54 68 69 73 20 69 73 20 61 20 74 65 73 74 2e. Of course that contains no font or layout information, just the raw content.

Rich Text (RTF)

This was introduced and its specification developed by Microsoft during the late 1980s and 90s, for cross-platform interchange, primarily between its own products. Support for this in Mac OS X came in Cocoa and its rich text editor TextEdit, inherited from NeXTSTEP. The format contains two main groups of features, styled text with fonts, and simple layout that has been extended to include the embedding of images and other non-text content.

RTF files consist of text, originally ASCII but now with Unicode support. Although not actually a mark-up language, its source code appears similar.

Each RTF file opens with the ‘magic’ characters {\rtf introducing information about conformity of the code. Following that is a preamble that is likely to contain platform-specific information, a font table and colour tables. The latter should include an expanded colour table for macOS. Then follows content, typically setting the font and size, with the paragraph content. For the example file, size is 378 bytes.

RTFD

RTF has several shortcomings, particularly in handling embedded images, so NeXTSTEP extended it to a bundle format, Rich Text Format Directory, RTFD, that transferred to Mac OS X. RTF content of a document is stored in a file named TXT.rtf, alongside separate files containing scalable images that can include PDF, and the whole directory is treated as if it was a single file. Although this works well in macOS, it never caught on in Windows, so hasn’t achieved the popularity it deserved. As the example file doesn’t have any images, its size as RTFD is also 378 bytes.

Microsoft Word

From its inception in 1983 until it switched to docx, Microsoft Word’s native file format has had the extension .doc. This is a binary format that has been successfully reversed for OpenOffice and LibreOffice open source, so incorporated into many products, including Cocoa and macOS.

From 2002, Microsoft Word has used a series of XML-based formats, since 2006 conforming to standards published first by Ecma then ISO/IEC, using the extension .docx, and known as Office Open XML. Support has been incorporated into macOS.

The .doc version of the example file requires 19 KB, while the .docx version takes only 4 KB.

HTML

This has evolved through a series of versions since its release in 1993, and is the markup language that dominates the web. Its structure should be well-known, and consists of an opening document type declaration followed by tagged elements containing metadata and content. Support for writing HTML is built into the Cocoa HTML Writer in macOS. This uses CSS to define styles in the header that are then applied to sections of the content, for example
<body>
<p class="p1">This is a test.</p>
</body>

The example requires only 538 bytes of HTML.

webarchive

This format is proprietary to Apple and its Safari browser, and when viewed in a capable text editor such as BBEdit, is shown as consisting of the serialised contents of a displayed web page, in XML format. In fact, as fds corrects me below, “a .webarchive is better described as a collection of web resources serialized via NSKeyedArchiver into binary plist format, bundled together into a single file in yet another property list, also saved in the binary property list format.”

When viewed in an editor such as BBEdit, after its opening XML and document type declaration as a property list, this consists of a dictionary of key-value pairs, themselves including sub-dictionaries of Web Resources. The content of each, its WebResourceData, is encoded in Base-64, making it impossible to read in a text editor. Although these can be large, for the example only 778 bytes of storage is required, showing the efficiency of the binary property list format.

WordML

Between the original .doc and the Ecma .docx formats, Microsoft Word used an intermediate WordProcessingML (or WordML) format in XML. After a standard XML header, this declares
<?mso-application progid=”Word.Document”?>
followed by a list of schemas. Although of largely historical interest now, some old Word documents may remain in this format. The example file requires 1 KB of storage.

ODT

This is OpenDocument Text, another XML-based format that was developed around the same time as WordML, and supported by many free apps and ‘office’ suites. Its opening structure is similar to that of WordML, but references oasis and OpenDocument sources. The example here requires 2 KB of storage.

Pages

One significant omission from the list of text formats supported by textutil is that used by Apple’s own Pages. This proprietary format changed significantly in 2009. Currently, a .pages document is a Zipped bundle containing thumbnail JPEG previews of the document, and two folders of files. Content appears to be saved in Apple iWork Archive files with the .iwa extension, and quite unlike RTFD.

Podofyllin 1.5 beta can export PDF to RTF

By: hoakley
27 November 2025 at 15:30

When I was developing my PDF reader Podofyllin, one of my goals was for it to be able to export from PDF to Rich Text Format. I never managed to get that to work, but in the light of your comments about supporting PDF as one of the convertible formats in Textovert, I revisited Podofyllin yesterday with the aim of adding that feature. And to my amazement it seems to work.

Before implementing any PDF conversions in Textovert, I’d be very grateful if you could test and comment on a beta of Podofyllin, which does now export PDF to RTF. If that code does a good enough job, then in the coming couple of weeks I will add PDF as one of the supported formats in Textovert, although that won’t rely so much on textutil as on my own code, and Quartz 2D support in macOS.

Podofyllin version 1.5 beta, build 38, adds a new command to its File menu to Export Rich Text. It also has printing disabled, as that has stopped working in recent versions of macOS, and needs repairs. I have disabled its update checking mechanism, so you won’t be pestered to ‘upgrade’ to version 1.4 when using this beta. Otherwise it retains all the features of version 1.4, and still has that Help book.

From my initial testing here the only significant oddity with the RTFs it writes may be small font sizes. There may be the occasional inappropriate use of a font, such as a line set in Courier in the midst of a paragraph in Helvetica, but those should be straightforward to correct. For small font sizes, I have simply selected all and used the Bigger font command to enlarge them all.

Podofyllin 1.5 beta, build 38, is now available from here: podofyllin15b
but not from anywhere else, for the time being. It requires macOS 11.5 or later.

Making the big presumption that PDF to RTF conversion proves worthwhile, this would make it possible to include PDF as one of the supported formats in Textovert. The snag is that would require the whole of any PDF document to be read into memory, before it could be converted to another format, in contrast to other formats where I suspect that textutil streams the input file during conversion. I don’t think there is any way to do that with PDF, because of its complex data structure.

So, if you think that this beta’s conversion of PDF to plain text and RTF is good enough to be useful, please let me know whether you want it built into Textovert, together with PDF to the other supported formats, or left in Podofyllin.

I wish you all a happy Thanksgiving, and thank you for your friendship, contributions and engagement.

Textovert 1.0: a convenient wrapper for text conversion

By: hoakley
25 November 2025 at 15:30

Yesterday I sang the praises of the little-known command tool textutil for converting text content between nine different formats. As I promised, today I offer a small wrapper app to make those conversions more convenient: Textovert.

textutil provides three features: general document information, text format conversion, and document concatenation. The first of those is probably best left to editors, and the last requires a document layout editor, so I chickened out of those for the time being. Textovert version 1.0 runs commands of the form
textutil -convert format filename1 -output filename2
where format is the format of the output file, filename1 is the input, and filename2 the output file.

You select the output format from the nine options in the window’s dropdown menu, before dropping any files onto that window. If you want to perform multiple conversions at the same time, you can open two or more windows and set each to its own output format.

Then drag and drop files to be converted onto the window. This version of Textovert only accepts files (and document bundles like RTFD), not folders, as they present several problems I’d rather not go into just yet. Textovert will then work through all the files one at a time and prompt you to select a filename and location for the converted file to be saved. For those converting just one or a handful of files at a time, this gives you fine control.

For those who have just dropped a batch of dozens of files onto the window, Textovert’s default behaviour is to save the converted files in the same location as the originals, with the same filename but the new extension. Thus, converting ~/Documents/Project/Meeting.doc to RTF will default to saving that converted file as ~/Documents/Project/Meeting.rtf. If you’re happy with that, you can click your way through saving each document without checking further.

As each converted file is saved, Textovert writes a simple one-line report to its window, giving the original filename, ✅ to mark success, and the converted file’s extension. You can select and copy those from its window if you want to keep a record.

That screenshot was taken during testing, and shows two unsuccessful conversions, marked with a red exclamation mark. Hopefully you won’t encounter any of those.

You should be able to convert pretty well any file, although how much text will be recovered depends on textutil‘s skills, not mine. The app comes with its own short Help book, accessible through the Help menu, and provided separately as well. It requires a minimum of Sonoma 14.6 to support its SwiftUI interface.

Textovert 1.0 is now available from here: textovert10
from Downloads above, and from its Product Page.

Enjoy!

Convert text between file formats, including webarchives

By: hoakley
24 November 2025 at 15:30

QuickLook makes it easy to preview most files, and TextEdit will display the text content of many formats. There are times, though, when it’s more convenient to extract the text content and save it in a different format, for example turning a Safari Web Archive or Word document into Rich Text. Thankfully, there’s a tool to do that in Terminal, textutil.

textutil is one of the older command tools, and was introduced in Mac OS X 10.4 Tiger twenty years ago. Despite that, it remains one of the most underused in modern macOS. It works by tapping into the macOS text system, using any of the following nine formats:

  • plain text (txt)
  • HTML (html)
  • Rich Text, RTF (rtf)
  • RTFD (rtfd)
  • Microsoft Word .doc and .docx (doc, docx)
  • Wordprocessing Markup Language, WordML (wordml)
  • OpenDocument Text, ODT (odt)
  • Safari Web Archive, webarchive (webarchive).

The name given in parentheses is that used in these commands.

The quality of format conversions is high, essentially the same as you’ll see in Apple’s apps. For example, here’s an original Word .doc file:

and here is a conversion to RTF using textutil:

If the original file contains embedded images or other non-textual content, though, those aren’t included in the output.

Display information

This is the simplest option, used as
textutil -info filename
where filename is the path and file name.

This displays basic information about the file, including its word count, and any metadata.

Format conversion

This extracts the text content of a file in one of its supported formats, and writes that out in a different format, as in
textutil -convert rtf filename
where filename is the path and file name. The output file will then have its extension replaced appropriately, for example
textutil -convert rtf myfile.html
will create the file myfile.rtf containing a Rich Text representation of the HTML file myfile.html. If you want to create a different output file, use a command like
textutil -convert rtf filename -output filename2.rtf

Only text content is written to the output file.

Joining files

textutil‘s other main feature is joining text-based files together to form a single file consisting of the input files concatenated together, as in
textutil -cat rtf -output filename.rtf -- file1.rtf file2.rtf file3.rtf
concatenates the three files file1.rtf file2.rtf file3.rtf into the single file filename.rtf in Rich Text format. You can also include implicit conversions such as
textutil -cat rtf -output filename.rtf -- file1.html file2.rtf file3.html
where the first and last parts of the single output file filename.rtf are converted to RTF before concatenation. Note the -- before the list of input files consists of two hyphen characters, not a dash.

Further options

Advanced options detailed in man textutil and textutil -help include:

  • change text encoding from the default of Unicode UTF-8,
  • change font and size,
  • exclude HTML elements,
  • specify metadata.

In macOS Tahoe you may also encounter warnings relating to font availability and substitution.

Summary

  • textutil -info filename for information;
  • formats txt, html, rtf, rtfd, doc, docx, wordml, odt, webarchive;
  • textutil -convert rtf filename for conversion;
  • textutil -cat rtf -output filename.rtf -- file1.rtf file2.rtf file3.rtf for concatenation.

To make this accessible from the GUI, I am working on a wrapper app named Textovert.

A Spotlight bug affecting all recent macOS: the LG error

By: hoakley
23 October 2025 at 14:30

There’s a bug in Spotlight that can prevent it from indexing any of the contents of susceptible text files. This has been present since macOS 13 Ventura if not before, and is still present in Tahoe 26.0.1. I didn’t discover this myself, though: it was reported to me by Jürgen, to whom full credit is due. It’s also one of the strangest bugs I’ve come across, and all depends on two letters.

Demonstration

To demonstrate this bug, all you need is a single UTF-8 plain text file, created by TextEdit or any other app capable of saving plain text. Start the text with the two characters L and G, both in capitals. Then add one or more distinctive words, such as
LG syzygy

Save that file to a folder that you know is indexed and searched by Spotlight, then a few seconds later try searching for the word syzygy in its contents. Extend this as much as you want, maybe appending the whole of one of Charles Dickens’ novels, but no matter how you search for its contents, that file will never be found. If you want to get more serious, use that text file in my Spotlight test app SpotTest, and it will also be unable to find that file.

This only works with plain text files, not Rich Text, PDF or HTML. It’s also sensitive to those two letters. Set one of them in lowercase, preface them with a space, or substitute a different letter, and the contents of that file will then be indexed correctly and searchable as normal.

Affected macOS

I have tested this in virtual machines going back as far as macOS 13 Ventura, and it’s present in them all. If you have access to an earlier version of macOS, I’d be interested to know whether it affects that as well.

Cause

The two UTF-8 characters concerned, 4c 47, don’t appear to be anything special that could be misinterpreted.

Although it’s not easy to distinguish failure to index from search errors, saving a test file does result in repeated reports of an error that could cause Spotlight to fail when trying to index the file, for example the log entries
30.946740 mdwrite Decoding error: Error Domain=NSCocoaErrorDomain Code=4864 UserInfo={NSDebugDescription=[private]} for [private]
30.951004 mds Decoding error: Error Domain=NSCocoaErrorDomain Code=4864 UserInfo={NSDebugDescription=[private]} for [private]

Error code 4864 is NSCoderReadCorruptError, implying that the presence of those two characters at the start of a text file may be triggering a bug in RichText.mdimporter, the importer module shipped in macOS that’s responsible for indexing plain text files.

My current hypothesis is therefore that text files starting with the characters LG are failing to have their contents indexed correctly because of a bug in RichText.mdimporter.

History

This isn’t the first bug in the RichText.mdimporter. In macOS Catalina 10.15.6, the same mdimporter (then build 319.60.100) introduced a bug that broke indexing of Rich Text (RTF) files. That was perpetuated through early releases of Big Sur until it was finally fixed in RichText.mdimporter build 326.11 in Big Sur 11.3.

Because text files starting with the characters LG are exceedingly unusual, this bug appears to have been left in RichText.mdimporter for a great deal longer.

I will be reporting this to Apple in Feedback later this month. Please feel free to file your own Feedback if you can spare the time.

Summary

  • In macOS 13 to 26, plain text files starting with the characters LG cannot be searched for their contents.
  • This appears to be the result of a longstanding bug in RichText.mdimporter in macOS.
  • If those characters are altered, or prefixed by a space, indexing and search behave normally.

I’m very grateful to Jürgen for drawing this to my attention.

❌
❌