Data clustering

Prabhas Chongstitvatana

lecture to master students of department of computer science, Rangsit university, 30 June 2002

I shall focus on a part of data mining called "data clustering".  For a broad overview of data mining please see my "data mining: an overview".

task of data clustering
definition of pattern
square error algorithms
k-means clustering algorithm
evolutionary clustering algorithm
incremental clustering
definition of category utility
incremental clustering algorithm
references

What is data mining

(A brief definition)
Data mining is algorithms which can extract meaningful information from the vast stores.

What Data mining is good for is to discover potential hypothesis prior to using statistical tools. Predictive modeling uses clustering to group items, then infers rules to characterise the groups and suggest models.

The most common tasks of data mining are:

clustering is to group similar data together. (like categorization) what category a data belong to?  The category can also be discovered automatically.

association is to uncover affinities among transaction records consisting of several variables, discover rules of the form: if item X is part of a transaction, then for some percent of the time, item Y is part of the transaction.

predictive modeling is either to classify data into one of several predefined categorical classes or to use selected fields from historical data to predict target fields.

Data clustering

task
-  data representation (+ feature extraction)
-  distance definition (similarity measure)
-  grouping
-  data abstraction (extracting simple and compact representation from a data set)

     |
     | pattern
     v
1 [feature selection/extraction]
     |
     | pattern representation
     v
2 [interpattern similarity]
     |
     |
     v
  [grouping]
     |
     | clusters
     v
  feedback to step 1,2
 

Definition

pattern  X  = {x1, x2, . . xd }   d measurement
xi  attribute
d  dimensionality
pattern set  { X1, X2, ... Xn}

pattern to be clustered is n x d pattern matrix

I will discuss a simple clustering algorithm.

notation:   m^n  =  m to the power of n or m superscript n,  xi  = x subscript i,  n_i  = n subscript i,  || x ||  = norm of x

square error algorithms

The square error (e^2) of the pattern matrix X for a clustering L (containing k clusters) is:

e^2(X,L) = sum j=1 to k, sum i=1 to n_j || Xi^(j) - c_j || ^ 2

where Xi^(j) is the i-th pattern belonging to the j-th cluster, c_j is the centroid of the j-th cluster

k-means is the simplest algorithm
( running time complexity is big-oh(nkl) where n is size of pattern, k is the number of cluster and l is the number of iteration.  k and l can be set to a constant hence this algorithm is linear with size of the data set, very good!)

Start with random initial partition and keeps reassigning the patterns to clusters based on the similarity between the pattern and the cluster centers until converge.

k-Means Clustering Algorithm

(1) Choose k cluster centers to coincide with k randomly-chosen patterns or k randomly defined points inside the hypervolume containing the pattern set.
(2) Assign each pattern to the closest cluster center.
(3) Recompute the cluster centers using the current cluster memberships.
(4) If a convergence criterion is not met, go to step 2.

Typical convergence criteria are: no (or minimal) reassignment of patterns to new cluster centers, or minimal decrease in squared error.

drawback: sensitive to initial seed, can produce only hypersherical clusters

Most improvement of this simple algorithm is to improve speed.  This algorithm is iterative, it is repeated until converge so it can be very time consuming.
 

Using Evolutionary algorithms

I will give an example of using genetic algorithms in clustering.
(from Jain and others 1999)

An Evolutionary Algorithm for Clustering

(1) Choose a random population of solutions. Each solution here corresponds to a valid k-partition of the data. Associate a fitness value with each solution. Typically, fitness is inversely proportional to the squared error value. A solution with a small squared error will have a larger fitness value.
(2) Use the evolutionary operators selection, recombination and mutation to generate the next population of solutions. Evaluate the fitness values of these solutions.
(3) Repeat step 2 until some termination condition is satisfied.

A chromosome represents a partition of N objects into K clusters and is represented by a K-ary string of length N. For example, consider six patterns—A, B, C, D, E, and F—and the string 101001. This six-bit binary (K = 2) string corresponds to placing the six patterns into two clusters:  [A,C,F] [B,D,E]

GA can be used in hybrid with other algorithms such as k-means. GA is used only to find good initial cluster centers and the k-means algorithm is applied to find the final partions.

Incremental clustering (Witten 2000, p.212)

The k-means iterates over the whole set of data until converge and gives a hard clustering, i.e. it labels each data item with the cluster number.  The number of cluster is given at the beginning.  In constrast the next clustering algorithm builds up the clustering hierarchy incrementally.  Its output is a tree of data items which each cluster shares common node.  The shape of the tree is determined by a measure called 'category utility' (CU).

The CU measures the similarity of an item with all existing clusters and it is used to decide whether to
1) share the same root
2) split to form a new root
3) split to form a new root and merge some existing node into it.

We will start with the definition of CU:

CU(C1, C2, ..., Ck) =
1/k (sum l Pr[C_l]) (sum i, sum j Pr[a_i = v_ij | C_l]^2 - Pr[a_i = v_ij]^2)

where C1,C2 ... is the k clusters, the first term  'sum l Pr[C_l]' is the summation over all clusters, a_i is the i-th attribute and its takes value v_i1, v_i2, ...The second term Pr[a_i = v_ij | C_l] says it is the probability that given
that this attribute takes this value when the data item has been assigned to C_l.  The third term Pr[a_i = v_ij] is the probability that an attribute takes a value.  The difference of this two terms means how good it is when a data is assigned to a cluster.  In other words, knowing the correct cluster should make it more likely to predict the value of this attribute.  This number should be high when the clustering is good.

Now we can show the incremental clustering algorithm:

initialise the tree, begin with one data with a root node.
1) for each data, calculate CU by supposing that placing this item on the existing level and evaluatie CU of the parents node.  if it is good then place it there.
2) if it is not good, consider merging it with other existing node.  To meger take two existing nodes, the best and the runner up.  Then evaluate if the merge is beneficial (improving CU after merging).  If so merge.
3) if not, consider splitting.  Splitting a node will raise the level of that node up one level.
4) repeat 1 to 3 until all data has been used.

since zero variance on attribute produces an infinite value in CU, there must be a minimum variance of the attribute imposed on the calculation of CU.  It is called 'acuity' number.  To control the growth of the tree, a second numeric value, called 'cutoff' is used.  When adding a new node does not produce enough increase in CU, that node is cut off.
(see detailed example in Witten p.213, 216)

References

I. Witten, E. Frank, Data mining, Morgan Kaufmann Pub., 2000.
A. Jain, M. Murty, P. Flynn. Data clustering: a review. ACM computing surveys, vol.31, no 3, Sept 1999