Clustering Algorithms in Python

I'm working through the book "Mining of Massive Datasets" (MMDS) to catch up with some of the latest advances in data mining, and when I hit chapter 7 on clustering algorithms, I couldn't help notice how naturally these algorithms can be implemented in Python. Of course, you're supposed to do this sort of thing using scikit or at least numpy but if you're familiar with this blog, you probably know I'm going to go through it from first principles because hey, where's the fun otherwise?

At a high level, the concept behind clustering is finding sets of similar items by first defining a measure of "closeness" — distance between — two items and identifying the ones that have the least distance between them. If you're dealing with data that can be meaningfully converted to numbers, then representing each data point by its numeric vector allows you to define distance geometrically: distances are given by the standard distance formula. I'll skip over the whole thorny issue of converting data into numbers here and jump into what to do with them once you have them.

Fundamentally, clustering algorithms work by identifying the closest two points and gathering them together into one cluster, which is then represented by their midpoint. Once this is done, the two points "disappear" from the algorithm and a new point — the midpoint — enters into the algorithm. Now, the next two closest points (one of which may be this new midpoint) are gathered together into a cluster. After a few such iterations, the data points that are closest together will belong to the same cluster. Since the points that represent the center of a cluster are "midpoints" of several points, they're called centroids — the average of all of the points they represent.

To implement this algorithm in Python, I'll represent clusters as dictionaries whose keys are the centroids of the points and whose values are lists of the individual points. To start out, I'll cluster points in Euclidean space which I'll represent as tuples of two integers. I need one function to compute the distance between two points, in listing 1:


def distance(p1, p2):
  return math.sqrt((p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2)

Listing 1: distance between two points in 2D space

This is the standard Euclidean distance formula. I'll also need a centroid function to compute the center of mass of a group of points — an extension of the standard midpoint formula — shown in listing 2.


def centroid(c_points):
  return (sum([x[0] for x in c_points]) / len(c_points), sum([x[1] for x in c_points]) / len(c_points))

Listing 2: centroid function

Given these two functions, the clustering algorithm consists of finding the two points closest together and combining them into a new cluster represented by the centroid. The only place this gets tricky is that once two points have been combined into a cluster, the distance from the next point to the cluster is the centroid. Therefore, since the keys of my clusters dictionary are the centroids, I'll start by assigning each point to its own cluster, find the two clusters that are closest together, and then combine them into a new cluster with a new centroid. One iteration of this algorithm is illustrated in listing 3.


    min_dist = sys.maxint
    points = clusters.keys()
    for p1 in range(len(points)):
      for p2 in range(len(points)):
        dist = distance(points[p1], points[p2])
        if points[p1] != points[p2] and dist < min_dist:
          min_dist = dist
          min_p1 = points[p1]
          min_p2 = points[p2]

    new_cluster = clusters[min_p1] + clusters[min_p2]
    center = centroid(new_cluster)
    del(clusters[min_p1])
    del(clusters[min_p2])
    clusters[center] = new_cluster

Listing 3: Identify the next cluster

This algorithm starts by cycling through every pair of points to find the one with the minimum distance (defined by the distance function in listing 1). The pairs of points are the keys of the "clusters" dictionary which are initialized from the points input. Once the closest two points are found, a new cluster is formed with the concatenation of each of the old cluster's points (which, again, are initialized as being a "cluster" of a single point) and the old clusters are removed. The key of the new cluster is the centroid of all of the points in the cluster.

Example 1 illustrates six iterations of this algorithm. I'm using the data points from example 7.2 of MMDS.


{(4.0, 10.0): [(4.0, 10.0)], 
 (10.0, 5.0): [(10.0, 5.0)], 
 (11.0, 4.0): [(11.0, 4.0)], 
 (6.0, 8.0): [(6.0, 8.0)], 
 (4.0, 8.0): [(4.0, 8.0)], 
 (5.0, 2.0): [(5.0, 2.0)], 
 (9.0, 3.0): [(9.0, 3.0)], 
 (12.0, 6.0): [(12.0, 6.0)], 
 (7.0, 10.0): [(7.0, 10.0)], 
 (2.0, 2.0): [(2.0, 2.0)], 
 (3.0, 4.0): [(3.0, 4.0)], 
 (12.0, 3.0): [(12.0, 3.0)]}

{(4.0, 10.0): [(4.0, 10.0)], 
 (10.5, 4.5): [(10.0, 5.0), (11.0, 4.0)], 
 (6.0, 8.0): [(6.0, 8.0)], 
 (4.0, 8.0): [(4.0, 8.0)], 
 (5.0, 2.0): [(5.0, 2.0)], 
 (9.0, 3.0): [(9.0, 3.0)], 
 (12.0, 6.0): [(12.0, 6.0)], 
 (7.0, 10.0): [(7.0, 10.0)], 
 (2.0, 2.0): [(2.0, 2.0)], 
 (3.0, 4.0): [(3.0, 4.0)], 
 (12.0, 3.0): [(12.0, 3.0)]}

{(10.5, 4.5): [(10.0, 5.0), (11.0, 4.0)], 
 (4.0, 9.0): [(4.0, 10.0), (4.0, 8.0)], 
 (6.0, 8.0): [(6.0, 8.0)], 
 (5.0, 2.0): [(5.0, 2.0)], 
 (9.0, 3.0): [(9.0, 3.0)], 
 (12.0, 6.0): [(12.0, 6.0)], 
 (7.0, 10.0): [(7.0, 10.0)], 
 (2.0, 2.0): [(2.0, 2.0)], 
 (3.0, 4.0): [(3.0, 4.0)], 
 (12.0, 3.0): [(12.0, 3.0)]}

{(4.0, 9.0): [(4.0, 10.0), (4.0, 8.0)], 
 (6.0, 8.0): [(6.0, 8.0)], 
 (5.0, 2.0): [(5.0, 2.0)], 
 (12.0, 6.0): [(12.0, 6.0)], 
 (7.0, 10.0): [(7.0, 10.0)], 
 (2.0, 2.0): [(2.0, 2.0)], 
 (3.0, 4.0): [(3.0, 4.0)], 
 (12.0, 3.0): [(12.0, 3.0)], 
 (10.0, 4.0): [(10.0, 5.0), (11.0, 4.0), (9.0, 3.0)]}

{(5.0, 2.0): [(5.0, 2.0)], 
 (12.0, 6.0): [(12.0, 6.0)], 
 (4.666666666666667, 8.666666666666666): [(4.0, 10.0), (4.0, 8.0), (6.0, 8.0)], 
 (7.0, 10.0): [(7.0, 10.0)], 
 (2.0, 2.0): [(2.0, 2.0)], 
 (3.0, 4.0): [(3.0, 4.0)], 
 (12.0, 3.0): [(12.0, 3.0)], 
 (10.0, 4.0): [(10.0, 5.0), (11.0, 4.0), (9.0, 3.0)]}

{(5.0, 2.0): [(5.0, 2.0)], 
 (12.0, 6.0): [(12.0, 6.0)], 
 (4.666666666666667, 8.666666666666666): [(4.0, 10.0), (4.0, 8.0), (6.0, 8.0)], 
 (7.0, 10.0): [(7.0, 10.0)], 
 (2.5, 3.0): [(2.0, 2.0), (3.0, 4.0)], 
 (12.0, 3.0): [(12.0, 3.0)], 
 (10.0, 4.0): [(10.0, 5.0), (11.0, 4.0), (9.0, 3.0)]}

{(10.5, 3.75): [(12.0, 3.0), (10.0, 5.0), (11.0, 4.0), (9.0, 3.0)], 
 (5.0, 2.0): [(5.0, 2.0)], 
 (12.0, 6.0): [(12.0, 6.0)], 
 (4.666666666666667, 8.666666666666666): [(4.0, 10.0), (4.0, 8.0), (6.0, 8.0)], 
 (7.0, 10.0): [(7.0, 10.0)], 
 (2.5, 3.0): [(2.0, 2.0), (3.0, 4.0)]}

Example 1: Cluster identification

At each iteration, I show in italics the two points that are selected as the two closest and in bold the cluster that was created from the previous closest two points. Note again that "close" is the distance between the centroids of the clusters.

Finally, how many iterations should be performed? One option is to keep going until there's only one cluster left and another is to stop after a predetermined number of iterations. A more computationally complex, but sensible middle-of-the-road approach is to stop when the points in the cluster are "too far" away from one another. To make this precise enough to implement, MMDS introduces the concept of the diameter of a cluster: the distance between the farthest two points in the cluster. You can compute this in a single list comprehension as shown in listing 4.


def diameter(cluster):
  return max([distance(p1, p2) for p1 in cluster for p2 in cluster])

Listing 4: Diameter of a cluster

The algorithm stops when the average diameter of all of the clusters increases too much in a single step. Listing 5 computes the average diameter of all of the clusters, again using a single list comprehension.


def average_diameter(clusters):
  return sum([diameter(c) for c in clusters.values()]) / len(clusters)

Listing 5: Average diameter of all clusters

I had to play around with this a little bit to try to find something that generated a reasonable cluster size - I found that stopping when the ratio of "new average" to "old average", weighted by the number of clusters, was greater than 0.5 generated a reasonable cluster count given my small sample data.


    min_dist = sys.maxint
    points = clusters.keys()
    for p1 in range(len(points)):
      for p2 in range(len(points)):
        dist = distance(points[p1], points[p2])
        if points[p1] != points[p2] and dist < min_dist:
          min_dist = dist
          min_p1 = points[p1]
          min_p2 = points[p2]

    new_cluster = clusters[min_p1] + clusters[min_p2]
    center = centroid(new_cluster)
    old_clusters = dict(clusters) # remember old config to back out if average goes up too high
    del(clusters[min_p1])
    del(clusters[min_p2])
    clusters[center] = new_cluster

    new_avg_diameter = average_diameter(clusters)
    if avg_diameter > 0:
      if (new_avg_diameter / avg_diameter) / len(clusters) > 0.5:
        clusters = old_clusters
        break

    avg_diameter = new_avg_diameter

Listing 6: Stop when "enough" clusters have been identified

All that's left is to initialize the clustering algorithm with the points - the points are provided as a list of tuples, and are converted by the clustering function into a dictionary whose keys are the cluster's centroids and whose values are lists of points — initially each the same. (This is fine for testing, but there may be some issues using floats as keys. I could mitigate this at the expense of readability by converting every number to some fixed-point integer representation).


def cluster(in_points):
  clusters = dict((p,[p]) for p in in_points)

Listing 7: Initializing the clustering function

Extending this just a bit by passing the distance and centroid functions as parameters to the algorithm, it's easy to experiment with different distance and centroid functions (including non-Euclidean measurements) as suggested in the exercises for section 7.2 of MMDS. The whole routine is shown in listing 8.


import sys
import math

# This is the distance between just two points
def distance_2d(p1, p2):
  return math.sqrt((p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2)

# This is the "center" of an arbitrary number of points
def centroid_2d(c_points):
  return (sum([x[0] for x in c_points]) / len(c_points), sum([x[1] for x in c_points]) / len(c_points))

# determine the "diameter" of a cluster of points (a list of tuples/pairs of floats): the maximum
# distance between any two points
def diameter(cluster):
  return max([distance_2d(p1,p2) for p1 in cluster for p2 in cluster])

# Cycle through the clusters and determine the average diameter of each
def average_diameter(clusters):
  return sum([diameter(c) for c in clusters.values()]) / len(clusters)

# in_points is a list of tuples (i.e. pairs) of points.
def cluster(in_points, distance, centroid):
  clusters = dict((p,[p]) for p in in_points)

  avg_diameter = 0

  while len(clusters) > 1:
    min_dist = sys.maxint
    points = clusters.keys()

    for p1 in range(len(points)):
      for p2 in range(len(points)):
        dist = distance(points[p1], points[p2])
        # Break ties arbitrarily: <= picks last; < picks first
        if points[p1] != points[p2] and dist < min_dist:
          min_dist = dist
          min_p1 = points[p1]
          min_p2 = points[p2]

    new_cluster = clusters[min_p1] + clusters[min_p2]
    center = centroid(new_cluster)
    old_clusters = dict(clusters) # remember old config to back out if the average goes up too high
    del(clusters[min_p1])
    del(clusters[min_p2])
    clusters[center] = new_cluster

    new_avg_diameter = average_diameter(clusters)
    if avg_diameter > 0:
      if (new_avg_diameter / avg_diameter) / len(clusters) > 0.5:
        clusters = old_clusters
        break

    avg_diameter = new_avg_diameter

  return clusters

print cluster([(11.0,4.0), (12.0,3.0), (4.0,10.0), (7.0,10.0), (4.0,8.0), (6.0,8.0), (12.0,6.0), 
  (10.0,5.0), (3.0,4.0), (9.0,3.0), (2.0,2.0), (5.0,2.0)], distance_2d, centroid_2d)

Listing 8: Complete 2-D clustering algorithm

This approach can easily be extended to n-dimensional space by just defining longer tuples; assuming that each tuple is a vector of identical length, the distance formula becomes:


def distance(p1, p2):
  return math.sqrt(sum([(p1[i] - p2[i]) ** 2 for i in range(len(p1))]))

Listing 9: n-dimensional distance formula

And the centroid formula becomes:


def centroid(c_points):
  return tuple(sum([x[i] for x in c_points]) / len(c_points) for i in len(c_points))

Listing 10: n-dimensional centroid formula

Of course, each dimension slows the whole thing down just a bit, although the running time is still dominated by the time spent cycling through every pair of points. In reality, it's not feasible to examine all of the data of interesting data sets (like, say, the purchases of all Amazon customers or the viewership of all of the videos on YouTube), so a more sophisticated algorithm like k-means is applied. At a high level, it works similarly to the implementation presented here, but uses more sophisticated statistical sampling to avoid processing every single pair of points but still ensure with high probability that the pairs that are closest to one another are still located in the same cluster. A popular Python implementation of K-Means that of SciKit.

Add a comment:

Completely off-topic or spam comments will be removed at the discretion of the moderator.

You may preserve formatting (e.g. a code sample) by indenting with four spaces preceding the formatted line(s)

Command Line Fanatic

Clustering Algorithms in Python

Add a comment:

Past Posts