Reading view

There are new articles available, click to refresh the page.

Textovert 1.1 can now convert PDF files to other formats

As promised last week, I have now produced a new version of Textovert that can extract text from PDF files and convert that to any of the nine formats supported by the app. Testing here suggests this could be generally useful, as the quality of output files appears good, and worth the small effort in conversion.

This new version offers the same conversions as the first, using textutil, but handles PDF files with a .pdf extension (case-insensitive) differently. When converting them to plain text, it loads the PDF and uses Quartz 2D’s PDF engine to extract the text for saving as a text file. When the output format is set to Rich Text (RTF), it uses the same engine to extract styled text and saves that as an RTF file. Note that doesn’t include layout information, but is generally a fairly faithful representation of the styles used in the original.

For the seven other output formats, Textovert first extracts styled text into a temporary RTF file, then hands that over to textutil to convert it to the selected output format.

Each PDF conversion is handled in a separate thread running at a high QoS in the background, to avoid blocking the main thread. As large conversions can take many seconds or even minutes to complete, Textovert’s window tracks how many are running at the moment. That’s most useful when converting batches of PDFs, when it’s easy to forget the last one or two that are still in progress.

Because each conversion gets its own thread, multiple simultaneous conversions will occupy as many CPU cores as are available, as shown in this CPU History for my seven heavyweight test PDFs. At the left of each chart the CPU % rises rapidly as all seven conversion threads are active. As those complete, the bursts of CPU activity diminish until they are from the single thread converting the largest of the PDFs.

Among those test PDFs are:

  • A 527-page book of 10.9 MB
  • A 5,754-page ISA reference of 14.7 MB
  • An 867-page book of 18 MB
  • A 141-page software manual of 24.4 MB
  • A 12,940-page reference manual created using FrameMaker 2019 and Adobe Acrobat Distiller 23.0 on Windows of 76.6 MB size, © 2013-2023 Arm Limited.

To give you an idea of the quality of output, this is a tiny excerpt of the last of those in its original PDF:

And this is the webarchive output from Textovert viewed in Safari:

Converting PDFs does require significantly more memory than those performed by textutil alone. For most documents of more modest size, 100-500 MB is usual, but my monster test PDFs usually rise toward 5 GB during their conversion. I have checked this version for memory leaks, and although it can hold onto some memory longer than I would have expected, that doesn’t continue to rise, and no leak is apparent.

Because PDF conversions are more intricate, I have added extensive error-reporting. For example, if you try to convert a PDF containing scanned images without any recognised text, that won’t have any recoverable text available, as will be reported in the main window. Once conversion is complete, Textovert tries to delete the intermediate RTF file from temporary storage, and if that fails, you’ll be warned.

Textovert version 1.1 for macOS 14.6 and later is now available from here: textovert11
from Downloads above, and from its Product Page.

I hope you find it useful.

Podofyllin 1.5 beta can export PDF to RTF

When I was developing my PDF reader Podofyllin, one of my goals was for it to be able to export from PDF to Rich Text Format. I never managed to get that to work, but in the light of your comments about supporting PDF as one of the convertible formats in Textovert, I revisited Podofyllin yesterday with the aim of adding that feature. And to my amazement it seems to work.

Before implementing any PDF conversions in Textovert, I’d be very grateful if you could test and comment on a beta of Podofyllin, which does now export PDF to RTF. If that code does a good enough job, then in the coming couple of weeks I will add PDF as one of the supported formats in Textovert, although that won’t rely so much on textutil as on my own code, and Quartz 2D support in macOS.

Podofyllin version 1.5 beta, build 38, adds a new command to its File menu to Export Rich Text. It also has printing disabled, as that has stopped working in recent versions of macOS, and needs repairs. I have disabled its update checking mechanism, so you won’t be pestered to ‘upgrade’ to version 1.4 when using this beta. Otherwise it retains all the features of version 1.4, and still has that Help book.

From my initial testing here the only significant oddity with the RTFs it writes may be small font sizes. There may be the occasional inappropriate use of a font, such as a line set in Courier in the midst of a paragraph in Helvetica, but those should be straightforward to correct. For small font sizes, I have simply selected all and used the Bigger font command to enlarge them all.

Podofyllin 1.5 beta, build 38, is now available from here: podofyllin15b
but not from anywhere else, for the time being. It requires macOS 11.5 or later.

Making the big presumption that PDF to RTF conversion proves worthwhile, this would make it possible to include PDF as one of the supported formats in Textovert. The snag is that would require the whole of any PDF document to be read into memory, before it could be converted to another format, in contrast to other formats where I suspect that textutil streams the input file during conversion. I don’t think there is any way to do that with PDF, because of its complex data structure.

So, if you think that this beta’s conversion of PDF to plain text and RTF is good enough to be useful, please let me know whether you want it built into Textovert, together with PDF to the other supported formats, or left in Podofyllin.

I wish you all a happy Thanksgiving, and thank you for your friendship, contributions and engagement.

Textovert 1.0: a convenient wrapper for text conversion

Yesterday I sang the praises of the little-known command tool textutil for converting text content between nine different formats. As I promised, today I offer a small wrapper app to make those conversions more convenient: Textovert.

textutil provides three features: general document information, text format conversion, and document concatenation. The first of those is probably best left to editors, and the last requires a document layout editor, so I chickened out of those for the time being. Textovert version 1.0 runs commands of the form
textutil -convert format filename1 -output filename2
where format is the format of the output file, filename1 is the input, and filename2 the output file.

You select the output format from the nine options in the window’s dropdown menu, before dropping any files onto that window. If you want to perform multiple conversions at the same time, you can open two or more windows and set each to its own output format.

Then drag and drop files to be converted onto the window. This version of Textovert only accepts files (and document bundles like RTFD), not folders, as they present several problems I’d rather not go into just yet. Textovert will then work through all the files one at a time and prompt you to select a filename and location for the converted file to be saved. For those converting just one or a handful of files at a time, this gives you fine control.

For those who have just dropped a batch of dozens of files onto the window, Textovert’s default behaviour is to save the converted files in the same location as the originals, with the same filename but the new extension. Thus, converting ~/Documents/Project/Meeting.doc to RTF will default to saving that converted file as ~/Documents/Project/Meeting.rtf. If you’re happy with that, you can click your way through saving each document without checking further.

As each converted file is saved, Textovert writes a simple one-line report to its window, giving the original filename, ✅ to mark success, and the converted file’s extension. You can select and copy those from its window if you want to keep a record.

That screenshot was taken during testing, and shows two unsuccessful conversions, marked with a red exclamation mark. Hopefully you won’t encounter any of those.

You should be able to convert pretty well any file, although how much text will be recovered depends on textutil‘s skills, not mine. The app comes with its own short Help book, accessible through the Help menu, and provided separately as well. It requires a minimum of Sonoma 14.6 to support its SwiftUI interface.

Textovert 1.0 is now available from here: textovert10
from Downloads above, and from its Product Page.

Enjoy!

❌