Data Mining:: Concepts and Techniques
Data Mining:: Concepts and Techniques
Data Mining:: Concepts and Techniques
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data?
Evolution of Sciences
Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. Mining: Concepts and Techniques April 12, 2012 4 Data 2002
1960s:
Data collection, database creation, and network DBMS Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
1970s:
1980s:
1990s:
2000s
5
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data? Time and Ordering: Sequential Pattern, Trend and Evolution Analysis Structure and Network Analysis Evaluation of Knowledge Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
6 Data Mining: Concepts and Techniques April 12, 2012
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery Data Mining process Task-relevant Data Data Warehouse Selection& transformation
Data Cleaning
Data Integration Databases
Data selection
Data transformation Data mining
Pattern evaluation
Knowledge presentation
Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery
End User
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Concepts and Techniques April 12, 2012
DBA
10
KDD Process
Input Data
Data PreProcessing
Data Mining
PostProcessing
Health care & medical data mining often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation
12
Applications
Data Mining
Visualization
Algorithm
Database Technology
High-Performance Computing
13
Chapter 1. Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data? Time and Ordering: Sequential Pattern, Trend and Evolution Analysis Structure and Network Analysis Evaluation of Knowledge Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
14 Data Mining: Concepts and Techniques April 12, 2012
Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levels Data to be mined Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Techniques utilized Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
15 Data Mining: Concepts and Techniques April 12, 2012
Algorithms must be highly scalable to handle such as tera-bytes of data Micro-array may have tens of thousands of dimensions Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations
Data Mining: Concepts and Techniques April 12, 2012
High-dimensionality of data
Descriptive task Predictive task 1. Descriptive task: 1.1 Data charectarization and descrimination a. Identifying data b. Selecting data
17
1.2 Mining frequently used patterns, associations and correlations a. Frequent item sets b. Frequent sub sequence c. Frequent Substrcture 2. Predictive task 2.1 data classification and data prediction a. Decision trees b. Neural networks 2.2 Cluster evaluation 2.3 Outlier evaluation
18 Data Mining: Concepts and Techniques April 12, 2012
Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences)
19
20
21