Normal view

There are new articles available, click to refresh the page.

Before yesterdayMain stream

Explainer: PDF format

By: hoakley

29 November 2025 at 16:00

Yesterday’s explainer covered a range of text formats, but stopped short of one of the most popular formats for text documents, Adobe’s Portable Document Format, PDF. Its origins are as old as the Mac, and it hasn’t changed much since the start of this century, so PDF is very different from the more recent file formats, and from its antecedent PostScript.

PostScript

PostScript files, with the extension .ps, start with a prologue containing metadata such as
%!PS-Adobe-3.0 %%Title: c:\output\online.dvi %%Creator: DVIPSONE 0.8 1991 Nov 30 16:22:12 SN 102 %%CreationDate: 1992 Mar 26 10:04:36

They then largely consist of dictionaries of PostScript programs, instructions that are to be used to construct the page being described, such as
%%Page: 3 4 dvidict begin bp % [3] 38811402 d U -34996224 d u -1582039 d U 29614244 r f2(3)s O o 34996224 d u -34340864 d u 8708260 r(of)s 185088 W(abstractions,)s 191757 X(such)S(as)S(\\mixed)S(blessing")S(and)S(\\retaliation,")T(in)S (semantic)S(nets)S(that)s o
and so on. These place each item of text and graphics on that page. PDF is completely different in that it consists of a tree of objects, sometimes many hundreds or thousands of them.

PDF

To be recognised as a PDF file, the first line must start with the ‘magic’ characters and give the version used:
%PDF-1.3
followed by a short line of non-ASCII bytes. It may appear surprising that macOS still writes a version that was defined in 1999, when the current version is 1.7 (before ISO standardisation) or 2.0 (standardised), and the Quartz 2D PDF engine may also report version 1.4. At least these ensure wide compatibility.

Then follows the main data, as a series of objects arranged in a flattened tree structure, starting like
3 0 obj << /Filter /FlateDecode /Length 158 >> […stream length 158…]
with a binary stream of data, which is here compressed using the Flate method (an improvement on LZW), terminated by
endobj
which defines object number 3.

Some objects consist of code or definitions, such as
1 0 obj << /Type /Page /Parent 2 0 R /Resources 4 0 R /Contents 3 0 R /MediaBox [0 0 595 842] >> endobj
which is a Page dictionary.

Somewhere towards the end of the file, you’re likely to find an object containing metadata, such as details of the PDF engine that built the file:
11 0 obj << /Title (Untitled) /Producer (macOS Version 26.1 \(Build 25B78\) Quartz PDFContext) /Creator (DelightEd) /CreationDate (D:20251126211410Z00'00') /ModDate (D:20251126211410Z00'00') >> endobj

Right at the end of the PDF file comes the cross reference, which starts like
xref 0 12 0000000000 65535 f 0000000252 00000 n
and ends with a trailer
trailer << /Size 12 /Root 8 0 R /Info 11 0 R /ID [ ] >> startxref 8785 %%EOF
and that EOF marker ends the PDF file.

Problems

Objects, as elements on the page, can be laid out almost randomly, something that often makes converting laid-out columns of text so infuriating. PDF can just drop in blocks of text and images in whatever order they come, which often doesn’t coincide with the original flow in the text. As a PDF file proceeds one page at a time, multiple columns laid out over several pages can be particularly disastrous to extract as text, or to reconstitute in any other way.

PDF files are extremely verbose, and their contents are now largely unreadable due to the extensive use of binary streams of data, and all their supporting information. A document containing a single character may thus result in a PDF file of 160 lines, making even expansive XML files look concise in comparison. The example file used in yesterday’s article takes 9 KB in storage, for a total of only 11 PDF objects.

When a PDF file is changed by annotation, the contents of each annotation are added to the file as further objects. To save apps from having to rewrite the whole PDF file every time a change is made, changes can be appended to the end of the main file contents. Those can then be incorporated into the body by rewriting the file in ‘flattened’ form.

It’s also important to remember how old the roots of PDF are. The first volume of the Unicode standard 1.0 wasn’t published until 1991, and its introduction into Mac OS was long delayed after that. Consequently, PDF remains based on 8-bit extended ASCII text, with the main characters in a PDF file still being original 7-bit ASCII. Handling characters is generally accomplished by specifying individual characters in a specific font. This is why font substitution in PDF documents so commonly results in incorrect characters being displayed, with characters outside the extended ASCII set being most vulnerable. In worst cases, this mojibake can render entire documents incomprehensible.

Does Preview write PDF/A?

The Eclectic Light Company

By: hoakley

21 November 2025 at 15:30

Earlier this week, when I considered how best to save websites using Safari, I pointed out that the PDFs it saves aren’t intended to be in archival format, using one of the PDF/A standards. As some of you pointed out, Preview has an option to export PDFs in “PDF/A” format. This article examines whether those are suitable for archiving.

PDF/A

PDF is a generic document type and includes a multiplicity of different standards. Standard PDF generated by the Quartz 2D engine should comply with PDF version 1.4, from 2001, although the first open ISO standard of 2008 was based on version 1.7, and the current ISO standard is version 2.0. There are also five specialised subsets of PDF, among them PDF/A intended for archival purposes, each with their own families of ISO standards.

PDF/A was originally based on PDF version 1.4, but more recently has adopted 1.7. Its standards impose additional restrictions on core features, such as requiring all fonts to be embedded, and forbidding the use of encryption and LZW compression. Its standards are based on three levels of conformance: basic (B), accessible (A), and full Unicode text (U). The two standards and levels in most common use are PDF/A-2A (accessible) and PDF/A-2B (basic). A more detailed account is given in Wikipedia’s article.

Although Preview claims to export PDF documents in PDF/A format, I’ve been unable to discover which standard or level those are intended to comply with. However, each of the test documents is reported by Adobe Acrobat CC (Pro) as claiming compliance with PDF/A-2B in ISO 19005-2.

Conformance

Three test PDF documents were used, two saved from Safari 26.1 (macOS 26.1) as detailed previously, and the Help book for LogUI, written by Nisus Writer Pro. All three were opened in Preview 11.0 (1113.2.5) and Exported As PDF/A, with just the Create PDF/A option ticked in the File Save dialog.

All three exported PDFs were then opened in Adobe Acrobat ‘Pro’ version 2025.001.20841. That reported that each claimed “compliance with the PDF/A standard”, so opened them read-only to ensure they couldn’t be modified. When each was verified against PDF/A-2B, that failed.

Details of the compliance failures were then obtained using Acrobat’s Preflight feature. In each there were multiple errors, such as those shown below.

To assess what changes were required to make the LogUI help book compliant with the standard, Acrobat then performed the conversion. Corrections it made are shown below.

Although those were quick and simple, without them the file exported from Preview wasn’t considered by Acrobat to comply with the standard.

When do you need to use PDF/A?

Although I’m confident that PDF documents created using the engine in Quartz 2D in macOS Tahoe will remain fully accessible for at least the next 20 years or more, looking 50 or 100 years ahead the use of a major open standard intended and widely used for archives becomes more important. Whether the imperfect PDF/A exported by Preview would make any difference to that is unclear.

If you intend any PDF documents created on Macs to be true archives that should stand the test of long times, then you should convert them into PDF/A-2B or another appropriate standard before committing them to archival storage. Otherwise, it’s moot whether Preview’s conversion is a good investment of your time.

Summary

According to Adobe Acrobat, the ‘PDF/A’ format exported by Preview doesn’t comply with its claimed standard of PDF/A-2B. Thus the answer to the question posed by the title is no, not quite.
If a PDF is intended to be accessible for decades into the future, it should be converted to a recognised PDF/A standard such as PDF/A-2B using Adobe Acrobat or an equivalent.
Other PDFs may as well be left in their original format, which should ensure their accessibility for at least the next 20 years or more.