Module 2: Introduction to Data Mining, Data Exploration and Data Pre-processing

1. Introduction to Data Mining

1.1 What is Data Mining?

Definition 1: The efficient discovery of previously unknown, valid, potentially useful, and understandable patterns in large datasets.

Definition 2: The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

1.2 Database Processing vs. Data Mining

Aspect Database Processing Data Mining
Query Well-defined Poorly defined
Language SQL No precise query language
Output Subset of database Not a subset of database

1.3 Query Examples Comparison

Database Queries:

Data Mining Queries:

1.4 Key Terminology

The Data Mining Task: For a given dataset D, language of facts L, interestingness function I_D,L and threshold c, find the expression E such that I_D,L(E) > c efficiently.


2. Data Mining Task Primitives

Data mining tasks can be specified through data mining queries defined by task primitives that allow interactive communication during the mining process.

2.1 Five Key Task Primitives