lecture to master degree students of the computer science department, faculty of science, Kasetsart university, 19th July 2002
What is data mining
Machine learning
Tasks of data mining
Clustering
Genetic algorithm in clustering
Applications
How to do data mining (a case study)
Conclusion
References
"Data mining is defined as the process of discovering patterns in data.
The process must be automatic or (more usually) semi-automatic. The
patterns discovered must be meaningful in that they lead to some advantage,
usually economic advantage. The data is invariably present in substantial
quantities."
(Witten and Frank 2000)
It is about finding and describing patterns in data. Machine learning technique is applied. The data takes the form of a set of examples. The output takes the form of predictions on new examples. Data mining can be applied to relational, transaction, and spatial databases, as well as large stores of unstructured data such as the World Wide Web.
Definition: (Mitchell 1997)
A computer program is said to "learn" from experience E with respect
to some class of tasks T and performance measure P, if its performance
at tasks in T, as measured by P, improves with experience E.
Machine learning algorithms are being used routinely to discover valuable knowledge from large commercial databases. There are many learning algorithms in machine learning:
As a multidisciplinary field, data mining draws from areas such as artificial intelligence, database theory, data visualization, marketing, mathematics, operations research, pattern recognition, and statistics. What Data mining is good for is to discover potential hypothesis prior to using statistical tools. Predictive modeling uses clustering to group items, then infers rules to characterise the groups and suggest models. For example, magazine subscribers can be clustered based on a number of factors (age, sex, income, etc.), then the resulting groups characterized in an attempt to find a model which will distinguish those subscribers that will renew their subscriptions from those that will not.
The most common tasks of data mining are:
association is to uncover affinities among transaction records consisting of several variables, discover rules of the form: if item X is part of a transaction, then for some percent of the time, item Y is part of the transaction.
predictive modeling is either to classify data into one of several predefined categorical classes or to use selected fields from historical data to predict target fields.
Definition
pattern X = {x1, x2, . . xd } d measurement
xi attribute
d dimensionality
pattern set { X1, X2, ... Xn}
pattern to be clustered is n x d pattern matrix
I will discuss a simple clustering algorithm.
where Xi^(j) is the i-th pattern belonging to the j-th cluster, cj is the centroid of the j-th cluster
k-means is the simplest algorithm ( big-oh(n) ). Starting with random initial partition and keeps reassigning the patterns to clusters based on the similarity between the pattern and the cluster centers until converge.
Typical convergence criteria are: no (or minimal) reassignment of patterns to new cluster centers, or minimal decrease in squared error.
drawback: sensitive to initial seed, can produce only hypersherical clusters
A chromosome represents a partition of N objects into K clusters and is represented by a K-ary string of length N. For example, consider six patterns—A, B, C, D, E, and F—and the string 101001. This six-bit binary (K = 2) string corresponds to placing the six patterns into two clusters: [A,C,F] [B,D,E]
GA can be used in hybrid with other algorithms such as k-means. GA is used only to find good initial cluster centers and the k-means algorithm is applied to find the final partions.
To perform data mining on such large scale there are several concerns; indexing, clustering, knowledge management
indexing
clustering: grouping similar documents together, similarity measure:
document vs query, document vs document
identify attributes, weights
- hierarchical
- non-hierarchical
word-intersection clustering big-oh(n^2)
phrase-intersection clustering big-oh(n log n)
suffix tree clustering
ranking algorithm: vector space model, authority -- content rich, hub
knowledge management
intelligent and adaptive web services:
metasearch -- combine several search engines
internet shopping: auction
multimedia retrieval: image -- query by image similarity, map, fingerprint,
faces, video, cross-language
knowledge extracted from texts
- assignment of texts to a predefined set of categories
- identify set of categories
- identify set of categories and grouping documents under them
automatic indexing: autmatic metadata generation
document organization
text filtering
word sense disambiguation (bank of england, bank of thames): spelling
correction, part of speech tagging, word choice selection
hierarchical categorization of web pages
This model has some obvious shortcomings: namely, it is not based on any principles or existing body of research, and it is not supported by either quantitative or qualitative research.
Data was carefully collected, categorized (workshop notes, project documentation deliverables, data models, competitor analysis reports, and so forth), and analyzed after each site visit. The primary data collection method was direct observation, extensive notes were taken during each site visit and at all meetings. A total of 10 site visits took place and data collection began in July 1998 and was completed by November 1998.
During the course of the data mining project, three discontinuity points emerged; anticipation, anxiety at stage 1 and frustration at stage 4.
Anticipation refers to TAKCO project team member expectations about the potential of data mining to deliver novel and interesting findings. This causes unrealistic expectations about what the data mining results would be. The following illustrates the point:
"...lead to identification of new product bundles so that the lunch time menu could be revolutionized. "
correction: establishing and reaching consensus on a realistic, measurable, and achievable management business goal and project goal.
Anxiety/Apprehension is the second project discontinuity point that also occurred during the business objectives determination stage. Both the technical and business members of the project team expressed concerns about the nature of the data preparation stage and the potential bias and noise that might be introduced into the data mining data set.
correction: A data audit stage was therefore added after data preparation to demonstrate the validity, reliability, consistency, and integrity of the resulting transformed data set to be mined.
Frustration, as a project discontinuity point, occurred during the interactive data mining and results analysis stage. "I already know that" comment was made frequently by the Research Supervisor in response to presented data mining results.
correction: Back end data mining, which involved data enrichment and additional data mining algorithm execution by the data mining specialist, was introduced as a separate stage. The intent of this additional stage was to increase the dimensionality of the data mining data set with third-party demographic data and then to have the data mining specialist perform additional data mining off-site.
Last update 21st June 2002