orngCluster

Clustering is a statistical method that obtains a set of elements and tries to clump similar elements together into “clusters”.

orngCluster supports the following kinds of clustering:

• agglomerative hierarchical: we greedily and recursively perform clustering (n-1) times and end up with a tree
• k-means: we know the number of clusters in advance
• fuzzy: we get a probability distribution of membership of an element to a given number of clusters

We can present the data to a clustering routine in two ways, either as a set of vectors in some n-dimensional space, or as a dissimilarity matrix. Each of the above algorithms works with both representations.

You can imagine the set of vectors as a set of points in space. We represent this in python as [[1,2],[2,1],[1,1],[3,4],[5,5],[6,5],[7,6]]. You can notice that there are two distinct clusters, one centered around [1,1], and the other around [6,6]. The [3,4] is somewhere in between. Let’s see how we can use orngCluster with k-means clustering:

>>> from orngCluster import *

>>> p = [[1,2],[2,1],[1,1],[3,4],[5,5],[6,5],[7,6]]

>>> c = MClustering(p,2)

MClustering is the class which performs k-means clustering on sets of vectors, c is the object, which contains all the results. The most important field of c is mapping. For all the elements, it contains the cluster to which the element has been assigned. Let’s look at it:

>>> c.mapping

[1, 1, 1, 2, 2, 2, 2]

This means that the first 3 elements have been assigned to the first cluster, and the remaining 4 to the second cluster. We might wonder which are the most characteristic elements for both clusters:

>>> c.medoids

[3, 5]

So, the characteristic elements are the third ([1,1]) for the first cluster and the fifth ([5,5]) for the second. Note that [1,2] is the first element, not the zero-th!

We might also wonder, how tight the clusters are:

>>> c.cdisp

[0.82166492511320099, 0.51174486803519059]

We notice that the first cluster is tighter than the second. c.disp is the average cluster tightness. We can use this to check if we have the right number of clusters. Usually the greater is the average tightness, the better is the clustering.

There are two ways of calculating distance between points. One is Manthattan, which we have been using until now and is the default choice, for example d(A,B) = abs(Ax-Bx) + abs(Ay-By). The other is Euclidean, d(A,B) = sqrt(sqr(Ax-Bx) + sqr(Ay-By)). If you want to use Euclidean metric, call k-means clustering like this: MClustering(p,2,1).

Let’s see how the fuzzy clustering works on this data, but we are going to use Euclidean metric:

>>> c = FClustering(p,2,1)

>>> c.mapping

[1, 1, 1, 2, 2, 2, 2]

It’s the same. However, we can look at the cluster membership probabilities:

>>> for i in c.membership:

...   for j in i:

...         print "%2.2f"%j ,

...   print

...

0.92 0.92 0.94 0.49 0.10 0.06 0.12

0.08 0.08 0.06 0.51 0.90 0.94 0.88

The first row corresponds to the first cluster, and the second row to the second cluster. We notice that the 4th element is somewhere in between both clusters, with the probability only slightly greater for the second cluster.

A dissimilarity matrix is a slightly different representation. We explicitly list how different each pair of elements is. We can represent this with a matrix, but because distances are symmetric (A is just as different from B, as B is from A), and because an element is identical to itself, we need only write the bottom half of the dissimilarity matrix:

>>> m = [[1.0], [2.0, 4.0], [3.0, 5.0, 6.0]]

>>> for i in m:

...   for j in i:

...         print "%2.2f"%j ,

...   print

...

1.00

2.00 4.00

3.00 5.00 6.00

This means that the elements 1 and 2 are 1.00 points apart, while 3 and 4 are 6.00 points apart.

All three clustering methods MClustering, FClustering, and  HClustering support dissimilarities if you put the D prefix in front:  DMClustering, DFClustering, and  DHClustering. Then simply pass the dissimilarity matrix instead of the vector set. Because we have not looked at hierarchical clustering, we’ll study this one now.

Hierarchical clustering has several algorithms for estimating the distance between clusters. They are:

1. Average linkage: how different on average are all pairs of elements, the first element is from the first, and the second from the second cluster.
2. Single linkage: how different is the closest pair of neighboring elements
3. Complete linkage: how different are the elements from the most different pair of two clusters
4. Ward’s method: Ward's minimum variance linkage method attempts to minimize the increase in the total sum of squared deviations from the mean of a cluster. Uff.
5. Weighted linkage method: it is a derivative of average linkage method, but where both clusters are weighted equally in order to remove the influence of different cluster size. Ugh.

At each step, the hierarchical clustering method merges the closest pair of clusters. Anyways, Ward’s method is the default choice. If you wanted to use the fifth, weighted linkage, call like this c = DHCluster(m, 5).

So, let’s look at what we can find out from the hierarchical clustering object:

>>> c = DHClustering(m)

>>> c.ac

0.50030890075261991

AC tells us how well the data clusters. The greater the number, the better it is. The hierarchical clustering object contains all possible cluster numbers. So we need to tell, how many clusters we would like to have. If we want to get the mapping for 2 clusters, do like this:

>>> c.domapping(2)

>>> c.mapping

[1, 2, 2, 2]

We might want to know what number of clusters is best. In k-means clustering, we have to start the clustering algorithms for all possible numbers, and then find one with the biggest c.disp (although this is just a heuristic). With hierarchical clustering it’s far simpler. We will work with the earlier vector set data, and use Euclidean metric:

>>> p = [[1,2],[2,1],[1,1],[3,4],[5,5],[6,5],[7,6]]

>>> c = HClustering(p,1)

>>> c.height

[1.29, 1.00, 3.85, 10.00, 1.00, 2.08]

The order of the array is not obvious, but it’s not important. We are interested in the sorted array. We obtain it like this, by copying the array first not to corrupt the clustering object:

>>> th = [i for i in c.height]

>>> th.sort()

>>> th

[1.0, 1.0, 1.29, 2.08, 3.85, 10.00]

The numbers indicate how far the clusters were when we merged them. The first three acts of merging (up to four clusters) didn’t create any real distortion (1.29). The inter-cluster distance in two succeeding clusterings was quadratic, but that’s acceptable.  However, merging into a single cluster is very bad, because it involves a big jump (10.00). So there are approximately 2-4 clusters. A well-known heuristic is the knee-point of the graph of distance relative to height.

There is more detail to the hierarchical clustering object, and consult the reference documentation if you are interested.