Module 2: Introduction to Data Mining, Data Exploration and Data Pre-processing

1. Introduction to Data Mining

1.1 What is Data Mining?

Definition 1: The efficient discovery of previously unknown, valid, potentially useful, and understandable patterns in large datasets.

Definition 2: The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

1.2 Database Processing vs. Data Mining

Aspect	Database Processing	Data Mining
Query	Well-defined	Poorly defined
Language	SQL	No precise query language
Output	Subset of database	Not a subset of database

1.3 Query Examples Comparison

Database Queries:

Find all customers who have purchased milk
Find all credit applicants with last name Smith
Identify customers who have purchased more than $10,000 in the last month

Data Mining Queries:

Find all items frequently purchased with milk (Association Rules)
Find all credit applicants who are poor credit risks (Classification)
Identify customers with similar buying habits (Clustering)

1.4 Key Terminology

Data (D): A set of facts (items), usually stored in a database
Pattern (E): An expression in a language L that describes a subset of facts
Attribute: A field in an item i in D
Interestingness (I): A function that maps an expression E in L into a measure space M

The Data Mining Task: For a given dataset D, language of facts L, interestingness function I_D,L and threshold c, find the expression E such that I_D,L(E) > c efficiently.

2. Data Mining Task Primitives

Data mining tasks can be specified through data mining queries defined by task primitives that allow interactive communication during the mining process.