Reading view

There are new articles available, click to refresh the page.

Textovert 1.1 can now convert PDF files to other formats

As promised last week, I have now produced a new version of Textovert that can extract text from PDF files and convert that to any of the nine formats supported by the app. Testing here suggests this could be generally useful, as the quality of output files appears good, and worth the small effort in conversion.

This new version offers the same conversions as the first, using textutil, but handles PDF files with a .pdf extension (case-insensitive) differently. When converting them to plain text, it loads the PDF and uses Quartz 2D’s PDF engine to extract the text for saving as a text file. When the output format is set to Rich Text (RTF), it uses the same engine to extract styled text and saves that as an RTF file. Note that doesn’t include layout information, but is generally a fairly faithful representation of the styles used in the original.

For the seven other output formats, Textovert first extracts styled text into a temporary RTF file, then hands that over to textutil to convert it to the selected output format.

Each PDF conversion is handled in a separate thread running at a high QoS in the background, to avoid blocking the main thread. As large conversions can take many seconds or even minutes to complete, Textovert’s window tracks how many are running at the moment. That’s most useful when converting batches of PDFs, when it’s easy to forget the last one or two that are still in progress.

Because each conversion gets its own thread, multiple simultaneous conversions will occupy as many CPU cores as are available, as shown in this CPU History for my seven heavyweight test PDFs. At the left of each chart the CPU % rises rapidly as all seven conversion threads are active. As those complete, the bursts of CPU activity diminish until they are from the single thread converting the largest of the PDFs.

Among those test PDFs are:

  • A 527-page book of 10.9 MB
  • A 5,754-page ISA reference of 14.7 MB
  • An 867-page book of 18 MB
  • A 141-page software manual of 24.4 MB
  • A 12,940-page reference manual created using FrameMaker 2019 and Adobe Acrobat Distiller 23.0 on Windows of 76.6 MB size, © 2013-2023 Arm Limited.

To give you an idea of the quality of output, this is a tiny excerpt of the last of those in its original PDF:

And this is the webarchive output from Textovert viewed in Safari:

Converting PDFs does require significantly more memory than those performed by textutil alone. For most documents of more modest size, 100-500 MB is usual, but my monster test PDFs usually rise toward 5 GB during their conversion. I have checked this version for memory leaks, and although it can hold onto some memory longer than I would have expected, that doesn’t continue to rise, and no leak is apparent.

Because PDF conversions are more intricate, I have added extensive error-reporting. For example, if you try to convert a PDF containing scanned images without any recognised text, that won’t have any recoverable text available, as will be reported in the main window. Once conversion is complete, Textovert tries to delete the intermediate RTF file from temporary storage, and if that fails, you’ll be warned.

Textovert version 1.1 for macOS 14.6 and later is now available from here: textovert11
from Downloads above, and from its Product Page.

I hope you find it useful.

Textovert 1.0: a convenient wrapper for text conversion

Yesterday I sang the praises of the little-known command tool textutil for converting text content between nine different formats. As I promised, today I offer a small wrapper app to make those conversions more convenient: Textovert.

textutil provides three features: general document information, text format conversion, and document concatenation. The first of those is probably best left to editors, and the last requires a document layout editor, so I chickened out of those for the time being. Textovert version 1.0 runs commands of the form
textutil -convert format filename1 -output filename2
where format is the format of the output file, filename1 is the input, and filename2 the output file.

You select the output format from the nine options in the window’s dropdown menu, before dropping any files onto that window. If you want to perform multiple conversions at the same time, you can open two or more windows and set each to its own output format.

Then drag and drop files to be converted onto the window. This version of Textovert only accepts files (and document bundles like RTFD), not folders, as they present several problems I’d rather not go into just yet. Textovert will then work through all the files one at a time and prompt you to select a filename and location for the converted file to be saved. For those converting just one or a handful of files at a time, this gives you fine control.

For those who have just dropped a batch of dozens of files onto the window, Textovert’s default behaviour is to save the converted files in the same location as the originals, with the same filename but the new extension. Thus, converting ~/Documents/Project/Meeting.doc to RTF will default to saving that converted file as ~/Documents/Project/Meeting.rtf. If you’re happy with that, you can click your way through saving each document without checking further.

As each converted file is saved, Textovert writes a simple one-line report to its window, giving the original filename, ✅ to mark success, and the converted file’s extension. You can select and copy those from its window if you want to keep a record.

That screenshot was taken during testing, and shows two unsuccessful conversions, marked with a red exclamation mark. Hopefully you won’t encounter any of those.

You should be able to convert pretty well any file, although how much text will be recovered depends on textutil‘s skills, not mine. The app comes with its own short Help book, accessible through the Help menu, and provided separately as well. It requires a minimum of Sonoma 14.6 to support its SwiftUI interface.

Textovert 1.0 is now available from here: textovert10
from Downloads above, and from its Product Page.

Enjoy!

Convert text between file formats, including webarchives

QuickLook makes it easy to preview most files, and TextEdit will display the text content of many formats. There are times, though, when it’s more convenient to extract the text content and save it in a different format, for example turning a Safari Web Archive or Word document into Rich Text. Thankfully, there’s a tool to do that in Terminal, textutil.

textutil is one of the older command tools, and was introduced in Mac OS X 10.4 Tiger twenty years ago. Despite that, it remains one of the most underused in modern macOS. It works by tapping into the macOS text system, using any of the following nine formats:

  • plain text (txt)
  • HTML (html)
  • Rich Text, RTF (rtf)
  • RTFD (rtfd)
  • Microsoft Word .doc and .docx (doc, docx)
  • Wordprocessing Markup Language, WordML (wordml)
  • OpenDocument Text, ODT (odt)
  • Safari Web Archive, webarchive (webarchive).

The name given in parentheses is that used in these commands.

The quality of format conversions is high, essentially the same as you’ll see in Apple’s apps. For example, here’s an original Word .doc file:

and here is a conversion to RTF using textutil:

If the original file contains embedded images or other non-textual content, though, those aren’t included in the output.

Display information

This is the simplest option, used as
textutil -info filename
where filename is the path and file name.

This displays basic information about the file, including its word count, and any metadata.

Format conversion

This extracts the text content of a file in one of its supported formats, and writes that out in a different format, as in
textutil -convert rtf filename
where filename is the path and file name. The output file will then have its extension replaced appropriately, for example
textutil -convert rtf myfile.html
will create the file myfile.rtf containing a Rich Text representation of the HTML file myfile.html. If you want to create a different output file, use a command like
textutil -convert rtf filename -output filename2.rtf

Only text content is written to the output file.

Joining files

textutil‘s other main feature is joining text-based files together to form a single file consisting of the input files concatenated together, as in
textutil -cat rtf -output filename.rtf -- file1.rtf file2.rtf file3.rtf
concatenates the three files file1.rtf file2.rtf file3.rtf into the single file filename.rtf in Rich Text format. You can also include implicit conversions such as
textutil -cat rtf -output filename.rtf -- file1.html file2.rtf file3.html
where the first and last parts of the single output file filename.rtf are converted to RTF before concatenation. Note the -- before the list of input files consists of two hyphen characters, not a dash.

Further options

Advanced options detailed in man textutil and textutil -help include:

  • change text encoding from the default of Unicode UTF-8,
  • change font and size,
  • exclude HTML elements,
  • specify metadata.

In macOS Tahoe you may also encounter warnings relating to font availability and substitution.

Summary

  • textutil -info filename for information;
  • formats txt, html, rtf, rtfd, doc, docx, wordml, odt, webarchive;
  • textutil -convert rtf filename for conversion;
  • textutil -cat rtf -output filename.rtf -- file1.rtf file2.rtf file3.rtf for concatenation.

To make this accessible from the GUI, I am working on a wrapper app named Textovert.

❌