Data mining: an overview

Prabhas Chongstitvatana

lecture to master degree students of the computer science department, faculty of science, Kasetsart university, 19th July 2002

What is data mining
Machine learning
Tasks of data mining
Clustering
Genetic algorithm in clustering
Applications
How to do data mining (a case study)
Conclusion
References

What is data mining

Data mining is algorithms which can extract meaningful information from the vast stores.

"Data mining is defined as the process of discovering patterns in data.  The process must be automatic or (more usually) semi-automatic.  The patterns discovered must be meaningful in that they lead to some advantage, usually economic advantage.  The data is invariably present in substantial quantities."
(Witten and Frank 2000)

It is about finding and describing patterns in data. Machine learning technique is applied.  The data takes the form of a set of examples.  The output takes the form of predictions on new examples.  Data mining can be applied to relational, transaction, and spatial databases, as well as large stores of unstructured data such as the World Wide Web.

Machine learning

Learning tasks can be defined broadly as any computer program that improves its performance at some task through experience.

Definition: (Mitchell 1997)
A computer program is said to "learn" from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Machine learning algorithms are being used routinely to discover valuable knowledge from large commercial databases.  There are many learning algorithms in machine learning:

Learning algorithm usually proceed with having an initial 'training set' of data then 'learn' from this set.  The result is some kind of description  that can be regard as rules in explaining the data.  This description is considered to be the information about data.  It can be used to perform several tasks; classification, prediction etc.

Tasks of data mining

The problems associated with data mining are statistical in nature. The difference between the current data mining techniques and statistical tools that we knew is the amount of data.  When data is huge, it is necessary to 'automate' many aspects of data collection, preparation, processing.  However, in the end, analysing the 'summary' of data is still a human job.

As a multidisciplinary field, data mining draws from areas such as artificial intelligence, database theory, data visualization, marketing, mathematics, operations research, pattern recognition, and statistics. What Data mining is good for is to discover potential hypothesis prior to using statistical tools. Predictive modeling uses clustering to group items, then infers rules to characterise the groups and suggest models.  For example, magazine subscribers can be clustered based on a number of factors (age, sex, income, etc.), then the resulting groups characterized in an attempt to find a model which will distinguish those subscribers that will renew their subscriptions from those that will not.

The most common tasks of data mining are:

clustering is to group similar data together. (like categorization) what category a data belong to?  The category can also be discovered automatically.

association is to uncover affinities among transaction records consisting of several variables, discover rules of the form: if item X is part of a transaction, then for some percent of the time, item Y is part of the transaction.

predictive modeling is either to classify data into one of several predefined categorical classes or to use selected fields from historical data to predict target fields.

Clustering (Jain and others 1999)

task
-  data representation (+ feature extraction)
-  distance definition (similarity measure)
-  grouping
-  data abstraction (extracting simple and compact representation from a
data set)

Definition
pattern  X  = {x1, x2, . . xd }   d measurement
xi  attribute
d  dimensionality
pattern set  { X1, X2, ... Xn}

pattern to be clustered is n x d pattern matrix

I will discuss a simple clustering algorithm.

square error algorithms

    e(X,L) = sum j=1 to k, sum i=1 to nj || Xi^(j) - cj || ^ 2

where Xi^(j) is the i-th pattern belonging to the j-th cluster, cj is the centroid of the j-th cluster

k-means is the simplest algorithm ( big-oh(n) ).  Starting with random initial partition and keeps reassigning the patterns to clusters based on the similarity between the pattern and the cluster centers until converge.

k-Means Clustering Algorithm

(1) Choose k cluster centers to coincide with k randomly-chosen patterns or k randomly defined points inside the hypervolume containing the pattern set.
(2) Assign each pattern to the closest cluster center.
(3) Recompute the cluster centers using the current cluster memberships.
(4) If a convergence criterion is not met, go to step 2.

Typical convergence criteria are: no (or minimal) reassignment of patterns to new cluster centers, or minimal decrease in squared error.

drawback: sensitive to initial seed, can produce only hypersherical clusters

Using Evolutionary algorithms

I will give an example of using genetic algorithms in clustering. (from Jain and others 1999)

An Evolutionary Algorithm for Clustering

(1) Choose a random population of solutions. Each solution here corresponds to a valid k-partition of the data. Associate a fitness value with each solution. Typically, fitness is inversely proportional to the squared error value. A solution with a small squared error will have a larger fitness value.
(2) Use the evolutionary operators selection, recombination and mutation to generate the next population of solutions. Evaluate the fitness values of these solutions.
(3) Repeat step 2 until some termination condition is satisfied.

A chromosome represents a partition of N objects into K clusters and is represented by a K-ary string of length N. For example, consider six patterns—A, B, C, D, E, and F—and the string 101001. This six-bit binary (K = 2) string corresponds to placing the six patterns into two clusters:  [A,C,F] [B,D,E]

GA can be used in hybrid with other algorithms such as k-means. GA is used only to find good initial cluster centers and the k-means algorithm is applied to find the final partions.

Applications (Kobayashi and Takeda 2000, Sebastiani 2002)

Vast amount of data now are on the web.  nternet grew exponentially in the past and will continue in the coming decade.  (400 M sites 2000, double every year) 85% of users are not satisfied with the current search engines.
slow, poor quality (noise, broken links etc.) 90% of all web traffic is spread over 100,000 sites, with 50% headed towards the top 900 most popular sites.  (1998)

To perform data mining on such large scale there are several concerns; indexing, clustering, knowledge management

indexing
clustering: grouping similar documents together, similarity measure: document vs query, document vs document
identify attributes, weights
  - hierarchical
  - non-hierarchical
word-intersection clustering  big-oh(n^2)
phrase-intersection clustering big-oh(n log n)
suffix tree clustering
ranking algorithm: vector space model, authority -- content rich, hub

knowledge management
intelligent and adaptive web services:
metasearch --  combine several search engines
internet shopping:  auction
multimedia retrieval: image -- query by image similarity, map, fingerprint, faces, video, cross-language

knowledge extracted from texts
- assignment of texts to a predefined set of categories
- identify set of categories
- identify set of categories and grouping documents under them

automatic indexing: autmatic metadata generation
document organization
text filtering
word sense disambiguation (bank of england, bank of thames): spelling correction, part of speech tagging, word choice selection
hierarchical categorization of web pages

How to do data mining

I will give an example of the process of doing data mining. Cabena and others (1998) proposed a five-stage 'how to data mining':
  1. business objectives determination,
  2. data preparation,
  3. data mining,
  4. results analysis,
  5. knowledge assimilation.
Business objectives determination is concerned with clearly identifying the business problem to be mined; data preparation involves data selection, pre-processing and transformation; data mining is concerned with algorithm
selection and execution; results analysis is concerned with the question of whether anything new or interesting has been found; and finally, knowledge assimilation is concerned with formulating ways to exploit new information.

This model has some obvious shortcomings: namely, it is not based on any principles or existing body of research, and it is not supported by either quantitative or qualitative research.

Case study of TAKCO (Hirji 2001)

TAKCO is a recognized leader in the Canadian fast-food industry. Some interesting aspects of the fast-food industry that were evidenced by project team member comments are:
1) it is consumer driven,
2) firms strive toward operational efficiency and
3) there is an orientation toward extensive marketing analysis to understand and influence consumer choices.

Data was carefully collected, categorized (workshop notes, project documentation deliverables, data models, competitor analysis reports, and so forth), and analyzed after each site visit. The primary data collection method was direct observation, extensive notes were taken during each site visit and at all meetings. A total of 10 site visits took place and data collection began in July 1998 and was completed by November 1998.

During the course of the data mining project, three discontinuity points emerged; anticipation, anxiety at stage 1 and frustration at stage 4.

Anticipation refers to TAKCO project team member expectations about the potential of data mining to deliver novel and interesting findings. This causes unrealistic expectations about what the data mining results would be. The following illustrates the point:

"...lead to identification of new product bundles so that the lunch time menu could be revolutionized. "

correction: establishing and reaching consensus on a realistic, measurable, and achievable management business goal and project goal.

Anxiety/Apprehension is the second project discontinuity point that also occurred during the business objectives determination stage. Both the technical and business members of the project team expressed concerns about the nature of the data preparation stage and the potential bias and noise that might be introduced into the data mining data set.

correction: A data audit stage was therefore added after data preparation to demonstrate the validity, reliability, consistency, and integrity of the resulting transformed data set to be mined.

Frustration, as a project discontinuity point, occurred during the interactive data mining and results analysis stage. "I already know that" comment was made frequently by the Research Supervisor in response to presented data mining results.

correction: Back end data mining, which involved data enrichment and additional data mining algorithm execution by the data mining specialist, was introduced as a separate stage. The intent of this additional stage was to increase the dimensionality of the data mining data set with third-party demographic data and then to have the data mining specialist perform additional data mining off-site.

discussion

1) a data mining project appears to follow a more elaborated set of stages than what has been previously reported.
  1. Business Objectives Determination,
  2. Data Prepara-tion,
  3. Data Audit,
  4. Interactive Data Mining and Results Analysis,
  5. Back End Data Mining,
  6. Results Synthesis and Presentation.
2) 45% of the total project effort was consumed by data mining analysis (Inter-active Data Mining and Results Analysis and Back End Data Mining) whereas only 30% was consumed by Data Preparation.

Conclusion

Data mining is an exiting multidisciplinary field of research which has many extremely useful applications.  The current research direction of data mining is the development of algorithms that can learn regularities in rich,
mixed media data.

References


Last update 21st June 2002