Module 06 | Notion

Introduction

Definition: Web Mining uses data mining techniques to automatically discover and extract information from web documents and services.

Why Mine the Web?

The WWW is an enormous global information service center containing news, advertisements, consumer information, financial management, education, and e-commerce content
Provides extensive data for data mining
Web logs contain sequences of URLs accessed by users

Web Mining Issues:

Size: Google indexes 3 billion documents (over 130 trillion pages in 2016), with 2.5 exabytes of data generated daily
Diverse types of data including web pages, structures, usage data, and supplemental data

Web Mining Taxonomy:

Crawlers (Robots/Spiders):

Traverse hypertext structure and collect information to construct indexes for search engines
Examples: Scrapy, Beautiful Soup, Selenium, Amazonbot, Bingbot

Types of Crawlers: