Reflections on Three Years of Reading Knuth

A little over three years ago, I picked up a boxed set of the first three volumes of Donald Knuth's "The Art of Computer Programming" (TAOCP). I knew them by reputation before I ever even considered reading them — I knew, for example, that they had a reputation for being packed with dense mathematicalese along with a lot of brain-bendingly difficult problems. For whatever reason, that's the sort of odd challenge that appeals to me, so I went into the series with the intent to work every single problem. That turned out to be unrealistically ambitious - for one thing, there are a few problems that are marked as "research problems" that Knuth himself admits should take a motivated graduate student a semester to complete satisfactorily. There are even a few problems that he lists as "unsolved research problems" that may not even have a solution. I ended up settling for making at least a realistic attempt at every problem; even so, I was able to work far fewer than I expected I should have been able to. This is some seriously advanced material. I finally finished volume 3 last month, after three years of working as many problems as I was smart enough to work (which turned out to be fewer than I'd care to admit - but don't judge me until you try it yourself!)

I know that a lot of programmer types talk about these books — more than once I've seen them referred to as the books that everybody recommends but nobody actually reads. A lot of people are put off by the academic packaging or the cost; I can't disagree with either of those concerns. The books are heavily academic — to the point that they can and have been used as college textbooks. They're also expensive but, I think, worth the cost.

Another reason people sometimes suggest staying away from TAOCP is that the material is irrelevant in the 21st century. This is sort of a more difficult concern to counter, honestly, especially now that I've finished all three books. On the one hand, I know that the Sun developers who put together the Java programming language leaned heavily on Knuth's discussion of random number generation and large-number mathematical operations in volume 2. On the other hand, though, most of us won't be undertaking any projects like the Java programming language in our careers. It's neat to read and entertaining to know, but you can probably get along your whole life just using the PRNG and mathematical libraries built into your programming environment without ever really understanding what makes them tick.

Volume 3 actually contains both the most relevant along with the least relevant material. The most relevant includes Knuth's discussion of sorting and searching routines. Of course, if you've studied computer science at any level, you already have a decent handle on sorting algorithms; you probably know that Quicksort is O(n log n) and Bubblesort is O(n²) and even if you can't quite remember what that terminology means, you know that Quicksort is better than bubble sort. I remember studying sorting algorithms as an undergraduate from Sedgewick; Sedgewick presented the worst-case behavior of each of the sorting algorithms and sort of punted on the "average" case by presenting them but explaining that their derivation was "beyond the scope of the book". As far as I know, besides papers published in inaccessible academic journals, the derivation of the average case of most sorting routines is beyond the scope of every book except for TAOCP where he patiently builds the concepts up from scratch. He does the same with many standard CS concepts like AVL trees and hash codes as well.

However, the least relevant material to a modern programmer appears in the middle of volume 3, spanning about 100 pages: a detailed description on how to most efficiently use external tape drives to sort data that's too large to fit into main memory. As with most things Knuth, the presentation is fascinating and the theory is amazing. In short, the idea is to sort as much data as you can onto as many tapes as you have, and then merge the data from the tapes onto a final sorted tape. He considers the time needed to rewind the tapes as well as taking advantage of tape drives that can read and write both forward and backward, even accounting for things like rotational speed and the time it takes a tape to get rotating at full speed. As amazing as it is that anybody was ever able to work this out to such efficiency, I can't imagine how any of these algorithms could be applied outside of their original use case and if you really actually need to sort more data than you can fit into RAM, there are almost definitely better solutions than a bank of external tape drives. As carefully as I read all the other sections, I did find myself skimming through the more difficult exercises in this section.

Still, even setting aside the irrelevance of the middle section, it's as questionable how useful the rest of the content is to a modern software developer. It's almost as unlikely that any of us will be implementing our own sorting or searching routines as it is that we'll be developing the sorts of random number generators or unlimited precision math libraries that were the focus of volume 2. As interesting as the material is, these books are a huge time commitment, and if you don't have a lot of time to spare, you can probably spend it more "efficiently" on the study of machine learning algorithms or something else "enterprisey". If you do have the time, though, whether you actually apply the lessons from these books or not, it's fascinating to work through.

The last concern I hear (or read, anyway) from people is that the books not only use assembler for code examples, but use an imaginary assembler language that Knuth made up just for these books. Well, if you're on the fence about taking the plunge and diving into TAOCP and you're concerned about the cost, content, or relevance of these books, I can't say that I can reasonably set your mind at ease on any of these counts. However, if you're put off by the use of assembler, I can assure you that it's actually a relatively small part of the books. Further, all of the algorithms are presented first in what Knuth calls "pseudocode" but is nearly a compilable language. In fact, it's close enough to Python that translating it almost verbatim is practically a mechanical process.

Consider listing 1, below. This is Knuth's description of one of the more complex algorithms from volume 3, the AVL tree algorithm which keeps a binary tree balanced dynamically as new nodes are added to it. As you can see, Knuth has variables, structures, and looping constructs (although he doesn't seem to have had a chance to confer with Djikstra with regards to the use of GOTOs).


The nodes of the tree are assumed to contain KEY, LLINK, and RLINK fields as in
Algorithm 6.2.2T. We also have a new field

B(P) = balance factor of NODE(P)

the height of the right subtree minus the height of the left subtree; this field
always contains either +1, 0, or -1. A special header node also appears at the top
of the tree, in location HEAD; the value of RLINK(HEAD) is a pointer to the root
of the tree, and LLINK(HEAD) is used to keep track of the overall height of the tree.
(Knowledge of the height is not really necessary for this algorithm, but it is useful
in the concatenation procedure discussed below.) We assume that the tree is nonempty,
namely that RLINK(HEAD) != ^.

For convenience in description, the algorithm uses the notation LINK(a,P) as a synonym
for LLINK(P) if a = -1, and for RLINK(P) if a = +1.

A1. [Initialize.] Set T <- HEAD, S <- P <- RLINK(HEAD). (The pointer variable P will
move down the tree; S will point to the place where rebalancing may be necessary, and
T always points to the parent of S.)

A2. [Compare.] If K < KEY(P), go to A3; if K > KEY(P), go to A4; and if K = KEY(P), the
search terminates successfully.

A3. [Move left.] Set Q <- LLINK(P). If Q = ^, set Q <= AVAIL and LLINK(P) <- Q and go
to step A5. Otherwise if B(Q) != 0, set T <- P and S <- Q. Finally set P <- Q and return
to step A2.

A4. [Move right.] Set Q <- RLINK(P). If Q = ^, set Q <= AVAIL and RLINK(P) <- Q and go
to step A5. Otherwise if B(Q) != 0, set T <- P and S <- Q. Finally set P <- Q and return
to step A2. (The last part of this step may be combined with the last part of step A3.)

A5. [Insert.] (We have just linked a new node, NODE(Q), into the tree, and its fields
need to be initialized.) Set KEY(Q) <- K, LLINK(Q) <- RLINK(Q) <- Λ, and B(Q) <- 0.

A6. [Adjust balance factors.] (Now the balance factors on nodes between S and Q need to
be changed from zero to +/-1.) If K < KEY(S) set a <- -1, otherwise set a <- +1. Then set
R <- P <- LINK(a,S), and repeatedly do the following operations zero or more times until
P = Q: If K < KEY(P) set B(P) <- -1 and P <- LLINK(P); if K > KEY(P), set B(P) <- +1 and
P <- RLINK(P). (If K = KEY(P), then P = Q and we proceed to the next step.)

A7. [Balancing act.] Several cases now arise:

  i) If B(S) = 0 (the tree has grown higher), set B(S) <- a, LLINK(HEAD) <- LLINK(HEAD) + 1,
     and terminate the algorithm.

 ii) If B(S) = −a (the tree has gotten more balanced), set B(S) <- 0 and terminate the algorithm.

iii) If B(S) = a (the tree has gotten out of balance), go to step A8 if B(R) = a, to A9 if
     B(R) = −a. (Case (iii) corresponds to the situations depicted in (1) when a = +1; S and
     R point, respectively, to nodes A and B, and LINK(−a,S) points to A, etc.)

A8. [Single rotation.] Set P <- R, LINK(a,S) <- LINK(−a,R), LINK(−a,R) <- S,B(S) <- B(R) <- 0.
    Go to A10.

A9. [Double rotation.] Set P <- LINK(−a,R), LINK(−a,R) <- LINK(a,P),LINK(a,P) <- R, LINK(a,S)
    <- LINK(−a,P), LINK(−a,P) ← S. Now set

             { (-a,0), if B(P) =  a;
(B(S),B(R))<-{ ( 0,0), if B(P) =  0;
             { ( 0,a), if B(P) = -a;

and then set B(P) <- 0

A10. [Finishing touch.] (We have completed the rebalancing transformation, taking (1) to (2),
with P pointing to the new subtree root and T pointing to the parent of the old subtree root S.)
If S = RLINK(T) then set RLINK(T) <- P, otherwise set LLINK(T) <- P.

Listing 1: AVL tree as described in volume 3 of "The Art of Computer Programming"

Converting this description into working Python code doesn't even require a deep understanding of the underlying algorithm. Knuth uses ^ to represent "NIL" (Python's NONE), <- to indicate assignment and the unusual syntax B(P) to represent the B field of P which most languages indicate with P.B. I've removed some of the exposition but kept the essentials of the description in comments in the Python translation in listing 2, below. All of the comments are Knuth's words, not mine.


# The nodes of the tree are assumed to contain KEY, LLINK, and RLINK fields.
# We also have a new field
#
# B(P) = balance factor of NODE(P)
#
# the height of the right subtree minus the height of the left subtree; this field
# always contains either +1, 0, or -1.  A special header node also appears at the top
# of the tree, in location HEAD; the value of RLINK(HEAD) is a pointer to the root
# of the tree, and LLINK(HEAD) is used to keep track of the overall height of the tree.
# We assume that the tree is nonempty, namely that RLINK(HEAD) != ^.

class node:
  def __init__(self):
    self.KEY = None
    self.LLINK = None
    self.RLINK = None
    self.B = 0

HEAD = node()
HEAD.RLINK = node()
HEAD.RLINK.KEY = 'A'

# For convenience in description, the algorithm uses the notation LINK(a,P) as a synonym
# for LLINK(P) if a = -1, and for RLINK(P) if a = +1.

def LINK(a, P):
  if a == -1:
    return P.LLINK
  else:
    return P.RLINK

def setLINK(a, P, val):
  if a == -1:
    P.LLINK = val
  else:
    P.RLINK = val

def avl(K):
# A1. [Initialize.] Set T <- HEAD, S <- P <- RLINK(HEAD).
  T = HEAD
  S = P = HEAD.RLINK
# A2. [Compare.] If K < KEY(P), go to A3; if K > KEY(P), go to A4; and if K = KEY(P), the
#     search terminates successfully.
  while True:
    if K == P.KEY:
      return P
    elif K < P.KEY:
# A3. [Move left.] Set Q <- LLINK(P). If Q = ^, set Q <= AVAIL and LLINK(P) <- Q and go
#     to step A5. Otherwise if B(Q) != 0, set T <- P and S <- Q. Finally set P <- Q and return
#     to step A2.
      Q = P.LLINK
      if Q is None:
        Q = node()
        P.LLINK = Q
        break
      else:
        if Q.B != 0:
          T = P
          S = Q
        P = Q
# A4. [Move right.] Set Q <- RLINK(P). If Q = ^, set Q <= AVAIL and RLINK(P) <- Q and go
#     to step A5. Otherwise if B(Q) != 0, set T <- P and S <- Q. Finally set P <- Q and return
#     to step A2.
    elif K > P.KEY:
      Q = P.RLINK
      if Q is None:
        Q = node()
        P.RLINK = Q
        break
      else:
        if Q.B != 0:
          T = P
          S = Q
        P = Q
# A5. [Insert.] Set KEY(Q) <- K, LLINK(Q) <- RLINK(Q) <- ^, and B(Q) <- 0.
  Q.KEY = K
  Q.LLINK = Q.RLINK = None
  Q.B = 0
# A6. [Adjust balance factors.] If K < KEY(S) set a <- -1, otherwise set a <- +1. Then set
#     R <- P <- LINK(a,S), and repeatedly do the following operations zero or more times until
#     P = Q: If K < KEY(P) set B(P) <- -1 and P <- LLINK(P); if K > KEY(P), set B(P) <- +1 and
#     P <- RLINK(P).
  if K < S.KEY:
    a = -1
  else:
    a = 1
  R = P = LINK(a,S)
  while P != Q:
    if K < P.KEY:
      P.B = -1
      P = P.LLINK
    elif K > P.KEY:
      P.B = 1
      P = P.RLINK

# A7. [Balancing act.] Several cases now arise:
#
#  i) If B(S) = 0, set B(S) <- a, LLINK(HEAD) <- LLINK(HEAD) + 1,
#     and terminate the algorithm.
#
  if S.B == 0:
    S.B= a
    #HEAD.LLINK = HEAD.LLINK + 1
    return Q

# ii) If B(S) = -a, set B(S) <- 0 and terminate the algorithm.

  elif S.B == -a:
    S.B = 0
    return Q
#
# iii) If B(S) = a, go to step A8 if B(R) = a, to A9 if B(R) = -a.
  elif S.B == a:
    if R.B == a:
# A8. [Single rotation.] Set P <- R, LINK(a,S) <- LINK(-a,R), LINK(-a,R) <- S,B(S) <- B(R) <- 0.
#     Go to A10.
      P = R
      setLINK(a, S, LINK(-a, R))
      setLINK(-a, R, S)
      S.B = R.B =0
    elif R.B == -a:
# A9. [Double rotation.] Set P <- LINK(-a,R), LINK(-a,R) <- LINK(a,P),LINK(a,P) <- R, LINK(a,S)
#     <- LINK(-a,P), LINK(-a,P) <- S. Now set
#
#                  { (-a,0), if B(P) =  a;
#     (B(S),B(R))<-{ ( 0,0), if B(P) =  0;
#                  { ( 0,a), if B(P) = -a;
#
#     and then set B(P) <- 0
      P = LINK(-a,R)
      setLINK(-a, R, LINK(a,P))
      setLINK(a,P, R)
      setLINK(a, S, LINK(-a,P))
      setLINK(-a, P, S)
      if P.B == a:
        S.B = -a
        R.B = 0
      elif P.B == 0:
        S.B = 0
        R.B =0
      elif P.B == -a:
        S.B = 0
        R.B = a

# A10. [Finishing touch.] If S = RLINK(T) then set RLINK(T) <- P, otherwise set LLINK(T) <- P.
  if S == T.RLINK:
    T.RLINK = P
  else:
    T.LLINK = P

  return Q

Listing 2: AVL implemented in Python, Knuth-style

As you can see, I changed very little; I kept the same variable names (even with their sort of dated single-capital letter characteristics). The only really tricky part was uncovering the implicit loops defined by his GOTO statements since Python thankfully doesn't actually support the GOTO statement, but even that was relatively straightforward (and remember that this is one of the most complex algorithms described in the series). Otherwise, a handful of modifications:

I replaced Q <= AVAIL in steps A3 & A4 with an actual dynamic allocation.
Since the implied "convenience function" LINK(a, P) appears on both the right and left side of assignments, I just went ahead and implemented it with two separate functions; once to function as a getter and another to function as a setter.
Knuth sidestepped the creation of the first node in an empty tree, so I did too, although this would have been trivial to address.
In step A7, I skipped the step that kept track of the height of the tree, since it wouldn't work in Python, and isn't necessary to the algorithm; it would again be trivial to keep track of this somewhere else.

Otherwise I just typed in exactly what was described by the algorithm and got a functioning, if not very Pythonic, AVL implementation.

So there you have it - if you're still skeptical about tackling TAOCP and the only thing keeping you away is the use of Knuth's MIX Assembler, rest assured that you can skip all of the MIX and still understand everything that's presented. Still, you might start to change your mind about investing the time to learn MIX once you get started; it's actually a lot of fun to work through.

Add a comment:

Completely off-topic or spam comments will be removed at the discretion of the moderator.

You may preserve formatting (e.g. a code sample) by indenting with four spaces preceding the formatted line(s)

Minitechnicus, 2018-07-17

While tapes might be outdated, the memory of a modern computer is still hierarchical. There's small amounts of fast memory close to the processor and lots of slow memory away from the processor. So merge sort is not just relevant but also faster than many other sorting algorithms.

Josh, 2018-07-17

That's what I was thinking, too, going into the section - however, the algorithms in question were actually specific to tapes, not to hierarchical memory caches. Specifically, they were specific to sequential-only storage devices. Note that merge sort, which is definitely still important, was covered very thoroughly in the first section.

Shikhar, 2018-12-12

Hey Joshua,

How did you go about reading the volumes, more specifically, what sort of routine (daily, weekly, ...) do you have for reading through Knuth's works ?
I have tried (at least twice) during my undergrads of going through Vol 1 but only ever got through some of the math exercises. Trying to take another swing at it, but this time, more for self-satisfaction than academic use.

Josh, 2018-12-14

I brought a copy with me every day when I went to eat lunch, and read or worked problems while I was eating. That routine ended up working well for me - I actually started to look forward to my 30-45 minutes with Knuth every day.

^, 2020-05-06

Thank you Josh for posting this review. I find it inspiring. Knuth seems to be in his own league here.

Please allow me to belabor a well-made point. Consider Facebook querying their database for all users that have dogs as pets. In a well-structured data base of all user data, this query would be conceivably constructed from some teraflops of data more than could be stored on any general personal computer, and would be constructed from many separate physical data stores. In your experience with this volume, would you consider this circumstance irrelevant to the tape reading and writing algorithms?

You've convinced me to give TAOCP a serious go after I've finished Concrete Mathematics, cheers.

Josh, 2020-05-07

I'll leave it to you to decide after you read it, but I still would be surprised to come across a use case where the tape algorithms actually came in handy. It's still a great read, though!

Tobias, 2020-06-04

Regarding the Assembler, MIX is superseded by MMIX, witch Knuth details in Volume 1 Fascicle 1 and is based on a "modern" (read: 90s) RISC design, while MIX was based on a Computer fom the early 60s. There is even a book from Martin Ruckert, The MMIX Supplement, that rewrites every Program in Volumes 1 to 3 for MMIX. That is also what Knuth intends to do later on. I sincerly hope he gets 100 years old.

Now about the relevance of this material today, one benefit probably is that you very thouroughly learn the methods for constructing and analysing algorithms from that immens three years of work you put in there. What's your take on this?

Josh, 2020-07-15

I will say - reading the books will help you appreciate what Knuth is capable of, but they won't turn you into Knuth ;) I don't think I could put together a complete probabilistic analysis of an algorithm from the ground up the way he can, but I do have a better handle on how to analyze others.

keith paton, 2020-10-25

where can I read about postman sort?

Kholrabi, 2021-01-18

The book 'MMIXware: A RISC Computer for the Third Millennium. by Donald E. Knuth' is written in literate programming style with C, and he writes about the superscalar simulator. You can run alternative branch prediction strategies and test performance on the imaginary architecture before you build it (ie: FPGA, RISC-V), simulate the caches, even simulate the operating system with it's own mmixal written bios image or simulated motherboard, and all of it is described in the book. To me this is the best part of his MMIX addition the whole idea of being able to test code for an infinite amount of CPU configurations.

Henry, 2021-05-13

Dear Josh,

It is quite inspiring to read what you have accomplished. It has alwauys been my goal to do so sometime in my life. How many years did it take you to read Vol 1-3?

Also, now that you have worked through them, do you think your mathematical or problem solving abilities have increased?

Thank you

Joshua Davies, 2021-05-14

I'd say I spent three years from start to finish - I'd mostly read (or try to work problems) for about an hour a day over lunch. As far as problem solving abilities... I guess it's hard to quantify, but I don't feel quite as intimidated by difficult math problems as I did before I started the series. It's definitely an experience I highly recommend. One of these days I'm going to get around to reading 4a and 4b, too!

f.dum, 2021-12-25

Discovering Knuth was one of the great moments in my past life as software engineer. If more so-called software engineers people would have read him instead of relying on the latest silver bullet, I may still be a software engineer, but actually, the state of code that I encountered throughout my 20 years in business just kept depressing me. They keep shouting about "reinventing the triangle", but when it comes to not reinventing any programming fallacies, they prefer to shout "it's old!".

Yeah, Knuth may not be sexy and hip, but in my book, he's the first one to turn to when you really want to know what it's all about.

Awesome blog, man :)

wally, 2022-08-14

In regards to consulting with Dijkstra on goto s, note that Knuth had a long and typically thorough article in Computing Surveys in I suppose the late 70s regarding the use of goto.

Command Line Fanatic

Reflections on Three Years of Reading Knuth

Add a comment:

Past Posts