Explainer: PDF format
Yesterday’s explainer covered a range of text formats, but stopped short of one of the most popular formats for text documents, Adobe’s Portable Document Format, PDF. Its origins are as old as the Mac, and it hasn’t changed much since the start of this century, so PDF is very different from the more recent file formats, and from its antecedent PostScript.
PostScript
PostScript files, with the extension .ps, start with a prologue containing metadata such as%!PS-Adobe-3.0
%%Title: c:\output\online.dvi
%%Creator: DVIPSONE 0.8 1991 Nov 30 16:22:12 SN 102
%%CreationDate: 1992 Mar 26 10:04:36
They then largely consist of dictionaries of PostScript programs, instructions that are to be used to construct the page being described, such as%%Page: 3 4
dvidict begin bp % [3]
38811402 d U
-34996224 d u
-1582039 d U
29614244 r
f2(3)s O o
34996224 d u
-34340864 d u
8708260 r(of)s
185088 W(abstractions,)s
191757 X(such)S(as)S(\\mixed)S(blessing")S(and)S(\\retaliation,")T(in)S
(semantic)S(nets)S(that)s o
and so on. These place each item of text and graphics on that page. PDF is completely different in that it consists of a tree of objects, sometimes many hundreds or thousands of them.
To be recognised as a PDF file, the first line must start with the ‘magic’ characters and give the version used:%PDF-1.3
followed by a short line of non-ASCII bytes. It may appear surprising that macOS still writes a version that was defined in 1999, when the current version is 1.7 (before ISO standardisation) or 2.0 (standardised), and the Quartz 2D PDF engine may also report version 1.4. At least these ensure wide compatibility.
Then follows the main data, as a series of objects arranged in a flattened tree structure, starting like3 0 obj
<< /Filter /FlateDecode /Length 158 >>
[…stream length 158…]
with a binary stream of data, which is here compressed using the Flate method (an improvement on LZW), terminated byendobj
which defines object number 3.
Some objects consist of code or definitions, such as1 0 obj
<< /Type /Page /Parent 2 0 R /Resources 4 0 R /Contents 3 0 R /MediaBox [0 0 595 842]
>>
endobj
which is a Page dictionary.
Somewhere towards the end of the file, you’re likely to find an object containing metadata, such as details of the PDF engine that built the file:11 0 obj
<< /Title (Untitled) /Producer (macOS Version 26.1 \(Build 25B78\) Quartz PDFContext)
/Creator (DelightEd) /CreationDate (D:20251126211410Z00'00') /ModDate (D:20251126211410Z00'00')
>>
endobj
Right at the end of the PDF file comes the cross reference, which starts likexref
0 12
0000000000 65535 f
0000000252 00000 n
and ends with a trailertrailer
<< /Size 12 /Root 8 0 R /Info 11 0 R /ID [
] >>
startxref
8785
%%EOF
and that EOF marker ends the PDF file.
Problems
Objects, as elements on the page, can be laid out almost randomly, something that often makes converting laid-out columns of text so infuriating. PDF can just drop in blocks of text and images in whatever order they come, which often doesn’t coincide with the original flow in the text. As a PDF file proceeds one page at a time, multiple columns laid out over several pages can be particularly disastrous to extract as text, or to reconstitute in any other way.
PDF files are extremely verbose, and their contents are now largely unreadable due to the extensive use of binary streams of data, and all their supporting information. A document containing a single character may thus result in a PDF file of 160 lines, making even expansive XML files look concise in comparison. The example file used in yesterday’s article takes 9 KB in storage, for a total of only 11 PDF objects.
When a PDF file is changed by annotation, the contents of each annotation are added to the file as further objects. To save apps from having to rewrite the whole PDF file every time a change is made, changes can be appended to the end of the main file contents. Those can then be incorporated into the body by rewriting the file in ‘flattened’ form.
It’s also important to remember how old the roots of PDF are. The first volume of the Unicode standard 1.0 wasn’t published until 1991, and its introduction into Mac OS was long delayed after that. Consequently, PDF remains based on 8-bit extended ASCII text, with the main characters in a PDF file still being original 7-bit ASCII. Handling characters is generally accomplished by specifying individual characters in a specific font. This is why font substitution in PDF documents so commonly results in incorrect characters being displayed, with characters outside the extended ASCII set being most vulnerable. In worst cases, this mojibake can render entire documents incomprehensible.
