Module 2: Introduction to Data Mining, Data Exploration and Data Pre-processing
1. Introduction to Data Mining
1.1 What is Data Mining?
Definition 1: The efficient discovery of previously unknown, valid, potentially useful, and understandable patterns in large datasets.
Definition 2: The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.
1.2 Database Processing vs. Data Mining
| Aspect |
Database Processing |
Data Mining |
| Query |
Well-defined |
Poorly defined |
| Language |
SQL |
No precise query language |
| Output |
Subset of database |
Not a subset of database |
1.3 Query Examples Comparison
Database Queries:
- Find all customers who have purchased milk
- Find all credit applicants with last name Smith
- Identify customers who have purchased more than $10,000 in the last month
Data Mining Queries:
- Find all items frequently purchased with milk (Association Rules)
- Find all credit applicants who are poor credit risks (Classification)
- Identify customers with similar buying habits (Clustering)
1.4 Key Terminology
- Data (D): A set of facts (items), usually stored in a database
- Pattern (E): An expression in a language L that describes a subset of facts
- Attribute: A field in an item i in D
- Interestingness (I): A function that maps an expression E in L into a measure space M
The Data Mining Task:
For a given dataset D, language of facts L, interestingness function I_D,L and threshold c, find the expression E such that I_D,L(E) > c efficiently.
2. Data Mining Task Primitives
Data mining tasks can be specified through data mining queries defined by task primitives that allow interactive communication during the mining process.
2.1 Five Key Task Primitives