lecture to master students of department of computer science, Rangsit university, 30 June 2002
I shall focus on a part of data mining called "data clustering". For a broad overview of data mining please see my "data mining: an overview".
task of data clustering
definition of pattern
square error algorithms
k-means clustering algorithm
evolutionary clustering algorithm
incremental clustering
definition of category utility
incremental clustering algorithm
references
What Data mining is good for is to discover potential hypothesis prior to using statistical tools. Predictive modeling uses clustering to group items, then infers rules to characterise the groups and suggest models.
The most common tasks of data mining are:
association is to uncover affinities among transaction records consisting of several variables, discover rules of the form: if item X is part of a transaction, then for some percent of the time, item Y is part of the transaction.
predictive modeling is either to classify data into one of several predefined categorical classes or to use selected fields from historical data to predict target fields.
|
| pattern
v
1 [feature selection/extraction]
|
| pattern representation
v
2 [interpattern similarity]
|
|
v
[grouping]
|
| clusters
v
feedback to step 1,2
pattern to be clustered is n x d pattern matrix
I will discuss a simple clustering algorithm.
notation: m^n = m to the power of n or m superscript n, xi = x subscript i, n_i = n subscript i, || x || = norm of x
e^2(X,L) = sum j=1 to k, sum i=1 to n_j || Xi^(j) - c_j || ^ 2
where Xi^(j) is the i-th pattern belonging to the j-th cluster, c_j is the centroid of the j-th cluster
k-means is the simplest algorithm
( running time complexity is big-oh(nkl) where n is size of pattern,
k is the number of cluster and l is the number of iteration. k and
l can be set to a constant hence this algorithm is linear with size of
the data set, very good!)
Start with random initial partition and keeps reassigning the patterns to clusters based on the similarity between the pattern and the cluster centers until converge.
Typical convergence criteria are: no (or minimal) reassignment of patterns to new cluster centers, or minimal decrease in squared error.
drawback: sensitive to initial seed, can produce only hypersherical clusters
Most improvement of this simple algorithm is to improve speed.
This algorithm is iterative, it is repeated until converge so it can be
very time consuming.
A chromosome represents a partition of N objects into K clusters and is represented by a K-ary string of length N. For example, consider six patterns—A, B, C, D, E, and F—and the string 101001. This six-bit binary (K = 2) string corresponds to placing the six patterns into two clusters: [A,C,F] [B,D,E]
GA can be used in hybrid with other algorithms such as k-means. GA is used only to find good initial cluster centers and the k-means algorithm is applied to find the final partions.
The CU measures the similarity of an item with all existing clusters
and it is used to decide whether to
1) share the same root
2) split to form a new root
3) split to form a new root and merge some existing node into it.
We will start with the definition of CU:
CU(C1, C2, ..., Ck) =
1/k (sum l Pr[C_l]) (sum i, sum j Pr[a_i = v_ij | C_l]^2 - Pr[a_i =
v_ij]^2)
where C1,C2 ... is the k clusters, the first term 'sum l Pr[C_l]'
is the summation over all clusters, a_i is the i-th attribute and its takes
value v_i1, v_i2, ...The second term Pr[a_i = v_ij | C_l] says it is the
probability that given
that this attribute takes this value when the data item has been assigned
to C_l. The third term Pr[a_i = v_ij] is the probability that an
attribute takes a value. The difference of this two terms means how
good it is when a data is assigned to a cluster. In other words,
knowing the correct cluster should make it more likely to predict the value
of this attribute. This number should be high when the clustering
is good.
since zero variance on attribute produces an infinite value in CU, there
must be a minimum variance of the attribute imposed on the calculation
of CU. It is called 'acuity' number. To control the growth
of the tree, a second numeric value, called 'cutoff' is used. When
adding a new node does not produce enough increase in CU, that node is
cut off.
(see detailed example in Witten p.213, 216)