Inside the PDF File Format

PDF files are all over the internet — publishers use them almost exclusively, and if you try to download any academic papers, the links usually come with a "PDF warning", just in case you don't feel like downloading a few megabytes of document and potentially opening up a separate window just to read the content. A lot of applications don't even have a "print" option; they just export a PDF view which you can then print from Acrobat. So what are these PDFs? Why PDF rather than HTML?

The truth is that PDF, or Portable Document Format, gets sort of a bad rap from users who inevitably compare it to HTML, but this isn't entirely fair, since PDF is optimized as a format for printing and concise document specification. By design, an HTML document is supposed to render in whatever format looks best for the user agent; PDF, on the other hand, is supposed to look exactly the same whether it's viewed on screen, on paper, on a mobile device, etc. How faithfully it does so is, of course, subject to the limitations of the target device (printers have a much higher resolution than any computer screen), but Adobe puts a lot of effort into preserving fidelity across targets. PDF has been around since the early 90's, having evolved from an earlier format called PostScript. Both were conceived and controlled by Adobe, a company that was founded by two of the engineers from Xerox who worked on the original desktop computer design.

PostScript is actually a fully-featured programming language. You can define procedures, conditional operators, variables, etc. PostScript is "Turing complete". However, PostScript is a programming language meant for printers to interpret, and PostScript "programs" ordinarily describe what a page or set of pages should look like. The PostScript commands are transmitted, in source code form, to the printer, which interprets/compiles the commands, updates the global state, and executes the commands which generally involve making physical marks on paper.

If you have access to a laser printer, it probably supports PostScript directly (I've had good luck with HP support for PostScript). Figure 1 is a complete PostScript program; you can send this directly as text, without any preprocessing, to a PostScript capable printer. For example, if you save figure 1 as "hello.ps" and your printer is at IP address 192.168.1.2, you could do this:

telnet 192.168.1.2 9100 < hello.ps
and the output should look like Figure 2.

/Times-Roman findfont
12 scalefont
setfont
newpath
50 700 moveto
(Hello, World) show

Figure 1: PostScript file

Figure 2: Printed PostScript file

The point being that PostScript is a text format for printers to interpret directly. Now, programming in PostScript is sort of like programming in assembler — you have infinite flexibility, but infinite tedium as well. PostScript doesn't even figure out where the line breaks should go on the paper; you're responsible for determining when you've reached the end of the line/page and move to a new one. (The technical term for this process is typesetting). Even the most die-hard of command-line fanatics don't program directly in PostScript but instead use a preprocessor like troff to deal with the typesetting. Troff input looks like Figure 3 and can be typeset and fed to a networked (PostScript capable) printer via a command like:

troff -T ps < source.tr | telnet 192.168.1.2 9100

.ll 6i
.ps 12
.vs 16
The area is \(*p\fIr\fR\|\s8\u2\d\s0

Figure 3: Troff source

With Troff, document authors could take advantage of some features that HTML authors or MS-Word users of today take for granted such as automatic computation of line breaks or justified alignment. Still, you can't say that Figure 3 is particularly readable — before proofreding [*] it, you'd need to convert it to PostScript and print it out(!). To save a few trees, on-screen PostScript readers like GhostScript were created.

Still, as a shared document format, PostScript had some problems. In order to view a PostScript file onscreen, the entire embedded program had to be interpreted and run. There was no possibility of random access, since the program itself maintained a global state — in order to show the user page 700, for example, the viewer program had to parse and intepret the first 699 pages so that the application would be in the correct state. In the early 1990's, Adobe started work on what they called the Portable Document Format which aimed to unify printer-friendly and screen-friendly formatting. Like PostScript, PDF is a text format which describes what a printer ought to do in order to display it; however, general programming constructs like loops and variables were removed to make random-access feasible.

Figure 4 is pretty much the smallest parseable PDF file you could put together.

%PDF-1.6
1 0 obj
<<
  /Type /Catalog
  /Pages 2 0 R
>>
endobj
2 0 obj
<<
  /Type /Pages
  /Count 1
  /Kids [3 0 R]
>>
endobj
3 0 obj
<<
  /Type /Page
  /Parent 1 0 R
  /MediaBox [0 0 614 794]
  /Contents 4 0 R
  /Resources 5 0 R
>>
endobj
4 0 obj
<<
  /Length 58
>>
stream
BT
/F0 1 Tf
12 0 0 12 10 750 Tm
(Hello, World) Tj
ET
endstream
endobj
5 0 obj
<<
  /ProcSet [/PDF]
  /Font <<
    /F0 6 0 R
  >>
>>
endobj
6 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Helvetica
>>
endobj
xref
0 6
0000000000 65535 f
0000000009 00000 n
0000000062 00000 n
0000000125 00000 n
0000000239 00000 n
0000000343 00000 n
0000000412 00000 n
trailer
< <01234567890ABCDEF>]
/Size 6
>>
startxref
488
%%EOF

Figure 4: Hello, World PDF

If you download this file and open it in Acrobat, you'll see a simple output similar to the one in figure 5; however, if you open the same file in a text editor like Notepad or vi, you'll see that it's identical to figure 4.

Download it, don't copy-paste it, because line-ending conventions matter here — I'll get to why below.

Figure 5: Hello, World PDF rendered

Figure 4 might seem a little opaque at first, but if you start to look at it, you can begin to see some regularity here. First, you see that there are regular delimiters obj and endobj. There are 6 of these, and each is given a sequential number. The obj entries are followed by an xref entry, a trailer entry and a startxref entry. PDF's are actually designed to be read "backwards" starting at the end. The very last entry before the %%EOF delimiter is the startxref entry:

startxref
488
This is a pointer to the cross-reference file. In this example, the xref file starts at byte 488 of the file.

This is why line-ending conventions matter; on a Windows machine, a text editor would save CRLF pairs for line endings, which would change the locations of the object entries within the file.

If you follow this backwards, you'll see that the startxref entry points to the line that reads xref. This section, which starts from the xref token and runs to the trailer section, is a list of pointers to the other objects in the file:

xref
0 6
0000000000 65535 f
0000000009 00000 n
0000000062 00000 n
0000000125 00000 n
0000000239 00000 n
0000000343 00000 n
The first line declares the range of pointers listed in the cross-reference section; in this case, the objects numbered 0 through 6 (the top line) are declared here. The remaining lines identify, one per line, the location of an object in the file. The first object is object #0, which, you'll notice, doesn't appear anywhere in the file. PDF requires that object #0 is declared as "free"; this is the meaning of the f at the end of the line. What about the 65535 in between the 0000000000 and the f?

65535 is the generation number. PDF allows documents to be revised and rolled back, with their revision history stored within the document itself rather than in an external revision control system. For this reason, every object in the file includes a generation number which starts at 0 when the document is authored for the first time and increments by one each time a revision to the object is made. You probably won't come across PDF files with non-zero generation numbers "in the wild" unless you deal with professional publishing software. Here, object #0 is at generation #65536 (the max), but all the others are at generation 0 — brand new.

So, this cross-reference table identifies 6 objects. The number of each object is given by its position in the list, so line 1 locates the first object, line 2 the second, and so forth. It's worth noting also that PDF doesn't require the numbers to appear sequentially in the file, and in general, PDF document creator software outputs them in a fairly random order; hence the need for the cross-reference table at the end. For this simple example, I put them in order because I'm not dealing with too many objects. However, this is a toy example — even the one-page newsletter that my kids' elementary school sends out each week declares a few hundred objects.

So, I've been talking a lot about "objects". What's a PDF object? You can see from figure 4 that an object is delimited by obj/endobj tags. In this example, each of the objects is additionally delimited by << >> pairs, but this isn't strictly a requirement; PDF allows all sorts of types to occur as objects, but most of the time, you'll see that they're << >>-delimited dictionary objects. PDF defines six type of objects: boolean (true/false), numeric, string, name, array and dictionary. Numerics and booleans can be recognized by their contents, but the other four have special delimiters that identify them to the parser. strings are delmited by parentheses (), names by slashes /, arrays by brackets [] and dictionaries by what are technically referred to as "guillemets" which is what the French use for quotation marks but what you and I (unless you're French) would probably just call "double angle brackets" <<>>. (Note that this is specifically two "less-than" signs and two "greater-than" signs; not the more correct &laquo; &raquo; characters «»)

In this example, all of the top-level objects are dictionaries which are name-value pairs where the names are name types and the values are any other type of object, including another dictionary. Also notice that, at the very end, after the xref section, there's a trailer which contains a dictionary object. The most important element of this dictionary is the Root entry which is a pointer to (surprise) the "root" of the document.

As you can observe from figure 4, top-level objects are numbered; each one must be given a unique number, and each should appear as an entry in the cross-reference table. Once this is done, pointers (or references) can be used in place of actual objects anywhere in the file. You see this throughout the file — references are specified as object_number generation_number R. So, 1 0 R is a pointer to object 1. Wherever a reference is encountered, the PDF parser replaces the reference with the object being referenced. In fact, you can think of figure 4 as being expanded as shown in figure 6:

%PDF-1.6
  trailer
  <</Root 
    <<
      /Type /Catalog
      /Pages 
      <<
        /Type /Pages
        /Count 1
        /Kids [
          <<
            /Type /Page
            /Parent XXXX
            /MediaBox [0 0 614 794]
            /Contents
              <<
                /Length 58
              >>
              stream
              BT
              /F0 1 Tf
              12 0 0 12 10 750 Tm
              (Hello, World) Tj
              ET
              endstream
            /Resources
              <<
                /ProcSet [/PDF]
                /Font <<
                  /F0
                    <<
                      /Type /Font
                      /Subtype /Type1
                      /BaseFont /Helvetica
                    >>
              >>
            >>
        >>
      ]
    >>
  >>
/ID [<01234567890ABCDEF> <01234567890ABCDEF>]
/Size 6
>>
488
%%EOF

Figure 6: expanded PDF example

This isn't a valid PDF — content streams can't be embedded inside other objects this way, and the Parent element of the Page dictionary must be a reference to a referencable object which it can't be in this case — but this demonstrates logically what the PDF parser does at display time.

So, turning back to the trailer:

trailer
<</Root 1 0 R
/ID [<01234567890ABCDEF> <01234567890ABCDEF>]
/Size 6
>>

The most important entry in the trailer dictionary is the /Root declaration — this is a reference to the Catalog object. The Catalog object:

1 0 obj
<<
  /Type /Catalog
  /Pages 2 0 R
>>
endobj
in turn points to the pages object:
2 0 obj
<<
  /Type /Pages
  /Count 1
  /Kids [3 0 R]
>>
endobj
which describes, at a high level, the structure of the document. In particular, this document has one page (the /Count entry) whose description can be found in /Kids. Notice also that the /Kids entry is a []-delimited array, not just a bare reference like the others. The single page is described in object 3:
3 0 obj
<<
  /Type /Page
  /Parent 1 0 R
  /MediaBox [0 0 614 794]
  /Contents 4 0 R
  /Resources 5 0 R
>>
endobj
Here, finally, we're starting to get to the meat of the document. First of all, the /MediaBox entry describes the actual size of the page in 1/72s of an inch — 614x794 ≈ 8.5"x11", the standard U.S. page size.

Additionally, there's a reference to a /Resource entry:

5 0 obj
<<
  /ProcSet [/PDF]
  /Font <<
    /F0 6 0 R
  >>
>>
endobj
The most important part of this is the /Font declaration which is a list of fonts. In this case, the document has only only font, so there's a single reference to a font object:
6 0 obj
<<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Helvetica
>>
endobj
This is the smallest font description you can legally put together in a PDF document; since PDF is specifically a printing language, you can imagine that it has a lot of support for font descriptions. In fact, you can completely specify the geometry of a font within a PDF document so that the printer can reproduce the font exactly as it was originally described. I'll leave that to PDF software and fontophiles, though, and turn finally to the actual content of the single page of this document:
4 0 obj
<<
  /Length 58
>>
stream
BT
/F0 1 Tf
12 0 0 12 10 750 Tm
(Hello, World) Tj
ET
endstream
endobj
Here, the object starts out as a dictionary, but is followed by a stream declaration. This stream contains a series of commands that the printer should execute to display the page. This format is somewhat reminiscent of PostScript, but deliberately scaled back so that the page is a standalone element. This is executed as:
BT                          <-- Begin text
/F0 1 Tf                    <-- Select Font 0; named in the resources entry
12 0 0 12 10 750 Tm         <-- Set the text translation matrix
(Hello, World) Tj           <-- output the string "Hello, World"
ET                          <-- End text

Besides the text translation matrix on the third line, you should find this pretty self explanatory. One thing to note is that each line ends with a "command" which is preceded by its arguments. So what is the Tm command all about? Well, in its most primal form, a document is just a collection of polygons — arbitrary shapes made up of straight and curved lines, optionally connected to one another and filled in. Letters on the page — glyphs, to be technical — are just shapes as far as the printer is concerned. Fairly intricate ones, no doubt, but still just shapes. So, from the printer's perspective, the whole document can be specified as a series of points on a Cartesian coordinate system which need to be connected together and filled in. This whole coordinate system is subject to a transformation at any time; this transformation is compactly specified as a matrix which will be applied to each point. I won't go into the vagaries of matrix operations and affine transformations here (I talked a bit about it last month), but the net result of the matrix specified in this example is to scale (enlarge) the shapes to 12 points and move them to position (10,750), measured from the lower-left corner of the page.

A full introduction to the PDF formatting language would take a book; PDF also has commands for drawing arbitrary lines in any color, specifying user interaction, downloading additional content from the internet, etc. However, all PDF documents follow this same format — page objects specify commands to be executed by the printer which should translate to a printed page.

Figure 4 is longer than it strictly needs to be; I've added a lot of formatting for readability. Since PDF isn't actually designed to be human- readable, documents are usually compressed by the removal of superfluous whitespace as illustrated in Figure 7.

%PDF-1.6
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj
2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj
3 0 obj<</Type/Page/Parent 1 0 R/MediaBox[0 0 614 794]/Contents 4 0 R/Resources 5 0 R>>endobj
4 0 obj<</Length 49>>stream
BT /F0 1 Tf 12 0 0 12 10 750 Tm (Hello, World) Tj ET
endstream
endobj
5 0 obj<</ProcSet[/PDF]/Font<</F0 6 0 R>>>>endobj
6 0 obj<</Type/Font/Subtype/Type1/BaseFont/Helvetica>>endobj
xref
0 0
0000000000 65535 f
trailer
<</Root 1 0 R/ID[<01234567890ABCDEF><01234567890ABCDEF>]/Size 0>>
startxref
450
%%EOF

Figure 7: Compacted Hello, World PDF

The only required whitespace is after the endobj tokens and before numbers. Notice in particular the lack of whitespace before the name tokens and their value tokens as in:

/Type/Font/Subtype/Type1/BaseFont/Helvetica
This is three individual dictionary entries, all run together on a single line with no intervening whitespace. Since "/" is not a valid name token character, the parser will know upon encountering it that the name part of the token is complete and the value part begins.

Since content streams are usually pretty long, PDF additionally allows (and virtually all applications take advantage of) the content stream for each page to be compressed within the document itself. PDF supports both Flate compression and LZW compression of content streams. Of course, images can also be embedded and can be compressed as well, including as JPEG streams. The tiny content stream in figure 5 is hardly worth compressing, but most PDFs include hundreds or thousands of typesetting commands, which are repetitive and lend themselves well to Lempel-Ziv style compression. In reality, the object declarations are pretty repetitive as well; however, if those were compressed, the whole document would need to undergo a decompression stage before the viewer could start rendering it, so PDFs are virtually always uncompressed except for their content streams and embedded graphics.

If you open pretty much any other PDF file in a text editor, you'll notice that the top two lines probably look like this:

%PDF-1.6
%äãÏÒ
The meaning of the "%PDF-1.6" part if obvious enough; this tells the opening application that this is a PDF file conformant to revision 6 of the specification, but what about the garble that follows it? This is a comment, so it's ignored by the PDF reader, but it serves as a warning that this document contains non-ASCII characters (which will invariably be the case if the document compresses its page content streams, which documents invariably do).

Putting the cross-reference table at the end of the document simplifies the job for the document creator, but it creates a poorer user experience for the consumer of the document, since the whole document must be scanned before the rendering software can do anything with it. If a PDF is created once and printed in total on paper, this makes sense — since the only human interaction with the file is that of the author generating it, the process ought to be optimized for him. However, modern PDF usage has a PDF file being viewed far more often onscreen than in printed form, so Adobe came up with the "linearized" form to streamline the generation of a viewable PDF. One of the main difference between linearized and non-linearized is that the cross-reference table comes at the front. This creates more work for the document authoring application, since the first byte of the file can't be output until the offsets of each object are known. However, this means that the user can jump to a page from the table of contents as soon as the first few kilobytes of the file have been processed; for a document containing many hundreds of pages, this can be a significant advantage.

*: Yes, that was a joke

Add a comment:

Completely off-topic or spam comments will be removed at the discretion of the moderator.

You may preserve formatting (e.g. a code sample) by indenting with four spaces preceding the formatted line(s)

Name: Name is required
Email (will not be displayed publicly):
Comment:
Comment is required
kiai, 2015-05-05
Thanks for this article. Very helpful.
xonyon, 2016-03-17
Thanks for this qualified piece of work! Splendid!
bob, 2016-07-07
the pdf example at the top has an interesting bug. It will open with adobe reader but when you go to close the reader it asks if you want to save. This is with Adobe Acrobat Reader DC.

I have the same issue with self generated pdfs and haven't found the reason (yet).

Said from that this is a nice writeup.
Josh, 2016-07-12
Ha - I never noticed that. I think this may be new behavior in Acrobat, actually - it looks like what it does is canonicalize the PDF by compressing the parts that are normally compressed. There's definitely a conversion that takes place.
My Book

I'm the author of the book "Implementing SSL/TLS Using Cryptography and PKI". Like the title says, this is a from-the-ground-up examination of the SSL protocol that provides security, integrity and privacy to most application-level internet protocols, most notably HTTP. I include the source code to a complete working SSL implementation, including the most popular cryptographic algorithms (DES, 3DES, RC4, AES, RSA, DSA, Diffie-Hellman, HMAC, MD5, SHA-1, SHA-256, and ECC), and show how they all fit together to provide transport-layer security.

My Picture

Joshua Davies

Past Posts