Reading view

There are new articles available, click to refresh the page.

Explainer: Text formats

textutil, and Textovert its wrapper, convert between nine different formats, most of them in widespread use for documents that are largely based on text. This article explains a little about each of them, and its sequel tomorrow looks at how PDF differs. In each case, I give an example file size for a document containing the words
This is a test.
in a total of 15 characters.

Plain text

Conventionally, plain text files in macOS are most usually encoded using Unicode UTF-8, requiring just 15 bytes for the hex bytes
54 68 69 73 20 69 73 20 61 20 74 65 73 74 2e. Of course that contains no font or layout information, just the raw content.

Rich Text (RTF)

This was introduced and its specification developed by Microsoft during the late 1980s and 90s, for cross-platform interchange, primarily between its own products. Support for this in Mac OS X came in Cocoa and its rich text editor TextEdit, inherited from NeXTSTEP. The format contains two main groups of features, styled text with fonts, and simple layout that has been extended to include the embedding of images and other non-text content.

RTF files consist of text, originally ASCII but now with Unicode support. Although not actually a mark-up language, its source code appears similar.

Each RTF file opens with the ‘magic’ characters {\rtf introducing information about conformity of the code. Following that is a preamble that is likely to contain platform-specific information, a font table and colour tables. The latter should include an expanded colour table for macOS. Then follows content, typically setting the font and size, with the paragraph content. For the example file, size is 378 bytes.

RTFD

RTF has several shortcomings, particularly in handling embedded images, so NeXTSTEP extended it to a bundle format, Rich Text Format Directory, RTFD, that transferred to Mac OS X. RTF content of a document is stored in a file named TXT.rtf, alongside separate files containing scalable images that can include PDF, and the whole directory is treated as if it was a single file. Although this works well in macOS, it never caught on in Windows, so hasn’t achieved the popularity it deserved. As the example file doesn’t have any images, its size as RTFD is also 378 bytes.

Microsoft Word

From its inception in 1983 until it switched to docx, Microsoft Word’s native file format has had the extension .doc. This is a binary format that has been successfully reversed for OpenOffice and LibreOffice open source, so incorporated into many products, including Cocoa and macOS.

From 2002, Microsoft Word has used a series of XML-based formats, since 2006 conforming to standards published first by Ecma then ISO/IEC, using the extension .docx, and known as Office Open XML. Support has been incorporated into macOS.

The .doc version of the example file requires 19 KB, while the .docx version takes only 4 KB.

HTML

This has evolved through a series of versions since its release in 1993, and is the markup language that dominates the web. Its structure should be well-known, and consists of an opening document type declaration followed by tagged elements containing metadata and content. Support for writing HTML is built into the Cocoa HTML Writer in macOS. This uses CSS to define styles in the header that are then applied to sections of the content, for example
<body>
<p class="p1">This is a test.</p>
</body>

The example requires only 538 bytes of HTML.

webarchive

This format is proprietary to Apple and its Safari browser, and when viewed in a capable text editor such as BBEdit, is shown as consisting of the serialised contents of a displayed web page, in XML format. In fact, as fds corrects me below, “a .webarchive is better described as a collection of web resources serialized via NSKeyedArchiver into binary plist format, bundled together into a single file in yet another property list, also saved in the binary property list format.”

When viewed in an editor such as BBEdit, after its opening XML and document type declaration as a property list, this consists of a dictionary of key-value pairs, themselves including sub-dictionaries of Web Resources. The content of each, its WebResourceData, is encoded in Base-64, making it impossible to read in a text editor. Although these can be large, for the example only 778 bytes of storage is required, showing the efficiency of the binary property list format.

WordML

Between the original .doc and the Ecma .docx formats, Microsoft Word used an intermediate WordProcessingML (or WordML) format in XML. After a standard XML header, this declares
<?mso-application progid=”Word.Document”?>
followed by a list of schemas. Although of largely historical interest now, some old Word documents may remain in this format. The example file requires 1 KB of storage.

ODT

This is OpenDocument Text, another XML-based format that was developed around the same time as WordML, and supported by many free apps and ‘office’ suites. Its opening structure is similar to that of WordML, but references oasis and OpenDocument sources. The example here requires 2 KB of storage.

Pages

One significant omission from the list of text formats supported by textutil is that used by Apple’s own Pages. This proprietary format changed significantly in 2009. Currently, a .pages document is a Zipped bundle containing thumbnail JPEG previews of the document, and two folders of files. Content appears to be saved in Apple iWork Archive files with the .iwa extension, and quite unlike RTFD.

Podofyllin 1.5 beta can export PDF to RTF

When I was developing my PDF reader Podofyllin, one of my goals was for it to be able to export from PDF to Rich Text Format. I never managed to get that to work, but in the light of your comments about supporting PDF as one of the convertible formats in Textovert, I revisited Podofyllin yesterday with the aim of adding that feature. And to my amazement it seems to work.

Before implementing any PDF conversions in Textovert, I’d be very grateful if you could test and comment on a beta of Podofyllin, which does now export PDF to RTF. If that code does a good enough job, then in the coming couple of weeks I will add PDF as one of the supported formats in Textovert, although that won’t rely so much on textutil as on my own code, and Quartz 2D support in macOS.

Podofyllin version 1.5 beta, build 38, adds a new command to its File menu to Export Rich Text. It also has printing disabled, as that has stopped working in recent versions of macOS, and needs repairs. I have disabled its update checking mechanism, so you won’t be pestered to ‘upgrade’ to version 1.4 when using this beta. Otherwise it retains all the features of version 1.4, and still has that Help book.

From my initial testing here the only significant oddity with the RTFs it writes may be small font sizes. There may be the occasional inappropriate use of a font, such as a line set in Courier in the midst of a paragraph in Helvetica, but those should be straightforward to correct. For small font sizes, I have simply selected all and used the Bigger font command to enlarge them all.

Podofyllin 1.5 beta, build 38, is now available from here: podofyllin15b
but not from anywhere else, for the time being. It requires macOS 11.5 or later.

Making the big presumption that PDF to RTF conversion proves worthwhile, this would make it possible to include PDF as one of the supported formats in Textovert. The snag is that would require the whole of any PDF document to be read into memory, before it could be converted to another format, in contrast to other formats where I suspect that textutil streams the input file during conversion. I don’t think there is any way to do that with PDF, because of its complex data structure.

So, if you think that this beta’s conversion of PDF to plain text and RTF is good enough to be useful, please let me know whether you want it built into Textovert, together with PDF to the other supported formats, or left in Podofyllin.

I wish you all a happy Thanksgiving, and thank you for your friendship, contributions and engagement.

❌