Introduction
Definition: Web Mining uses data mining techniques to automatically discover and extract information from web documents and services.
Why Mine the Web?
- The WWW is an enormous global information service center containing news, advertisements, consumer information, financial management, education, and e-commerce content
- Provides extensive data for data mining
- Web logs contain sequences of URLs accessed by users
Web Mining Issues:
- Size: Google indexes 3 billion documents (over 130 trillion pages in 2016), with 2.5 exabytes of data generated daily
- Diverse types of data including web pages, structures, usage data, and supplemental data
Web Mining Taxonomy:
- Web Content Mining: Search Result Mining
- Web Structure Mining: Hyperlink structure
- Web Usage Mining: General Access Pattern Tracking, Customized Usage Tracking
Web Content Mining
Crawlers (Robots/Spiders):
- Traverse hypertext structure and collect information to construct indexes for search engines
- Examples: Scrapy, Beautiful Soup, Selenium, Amazonbot, Bingbot
Types of Crawlers:
- Traditional Crawler: Visits entire Web, replaces index
- Periodic Crawler: Visits portions of the Web, updates subset of index