Nothing Special   »   [go: up one dir, main page]

Seminar5-Week 5-Data Mining and Data Analytics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

ACCT 2004 – Business Technologies & Data Management for Accountants:

Week 5 – Data Mining & Data Analytics

Week 5-Semester 1

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Curtin University is a trademark of Curtin University of Technology
CRICOS Provider Code 00301J
Copyright Information

▪ Unless specified otherwise, all materials

are developed by the UC or based on the
textbooks used in this unit.

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Info. of Unit Coordinator
▪ Unit Coordinator is:
▪ Dr June Cao
▪ Room 407.442

▪ Your Lecturer:
▪ June Cao
▪ Best contact is email (as above)

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Agenda of This Week

▪ What is data mining

▪ Why data mine
▪ Types of data collected
▪ Storage of data
▪ What can you do with the data
▪ How do we data mine
▪ Examples of data mining

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Source: Adapted from PwC 2016, ‘Redefining business success in a changing world’, CEO Survey 2016,
CEO-survey/assets/2016-global-investor-survey.pdf, accessed 6 February 2019.

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
What Is Data Mining?

▪ No unique definition
• “The field of data mining is still relatively new and in a state of
evolution. The first International Conference on Knowledge Discovery
and Data Mining (KDD) was held in 1995, and there are a variety of
definitions of data mining.” (Shmueli, Patel, and Bruce 2010)
▪ Commonly defined as:
• … the use of efficient techniques for the analysis of very large
collections of data and the extraction of useful and possibly unexpected
patterns in data.

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
What Is Data Mining?-Cont.

▪ Gartner Group
▪ “Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting through
large amounts of data stored in repositories, using pattern
recognition technologies as well as statistical and
mathematical techniques”
▪ Source:

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
What Is Data Mining?-Cont.
▪ “Data mining is a diverse set of techniques for discovering
patterns or knowledge in data. This usually starts with a
hypothesis that is given as input to data mining tools that use
statistics to discover patterns in data. Such tools typically
visualize results with an interface for exploring further. ”(Source:
▪ Data mining- the art of extracting useful information from large
amounts of data-is growing importance in today’s world.

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J

Source: lecture notes of Tan, Steinbach, Karpatne, and Kumar, 2018

What Is Data Mining?-Cont.
▪ “Data mining is the analysis of (often large) observational data
sets to find unsuspected relationships and to summarize the data
in novel ways that are both understandable and useful to the data
analyst” (Hand, Mannila, Smyth, 2010)
▪ “Data mining is the discovery of models for data” (Leskeoec,
Rajaraman, Ullman, 2019)
We can have the following types of models
Models that explain the data (e.g., a single function)
Models that predict the future data instances.
Models that summarize the data
Models the extract the most prominent features of the data.

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Connections of Data Mining with other areas
▪ Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
▪ Traditional Techniques may be unsuitable due to
• Enormity of data
• High dimensionality of data
Statistics/ Machine Learning/
• Heterogeneous, AI Pattern
distributed nature Recognition
of data
Data Mining
• Emphasis on the use of data

Database Systems

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Models vs. Analytic Processing
▪ To a database person, data-mining is an extreme form of
analytic processing – queries that examine large amounts
of data.
• Result is the query answer.

▪ To a statistician, data-mining is the inference of models.

• Result is the parameters of the model.

CS345A Data Mining on the Web: Anand Rajaraman, Jeff Ullman

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Data Mining: Confluence of Multiple Disciplines

Technology Statistics

Machine Visualization
Learning Data Mining

Recognition Other

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Why Data Mining?
▪ Commercial point of view
• Data has become the key competitive advantage of companies fMRI Data from Brain
o Examples: Facebook, Google, Amazon
• Being able to extract useful information out of the data is key for
exploiting them commercially.
▪ Scientific point of view
• Scientists are at an unprecedented position where they can collect Sky Survey Data
TB of information
o Examples: Sensor data, astronomy data, social network data,
gene data
• We need the tools to analyze such data to get a better
understanding of the world and advance science
▪ Scale (in data size and feature dimension) Gene Expression Data

• Why not use traditional analytic methods?

• Enormity of data, curse of dimensionality
• The amount and the complexity of data does not allow for
manual processing of the data. We need automated techniques.
Surface Temperature of Earth

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Why Data Mining?-Cont.
▪ Massive amounts of data are now produced!
• In the digital age, TB of data is generated by the second
• Mobile devices, digital photographs, web documents, etc.
• Facebook updates, Tweets, Blogs, User-generated content
• Transactions, sensor data, surveillance data
• Queries, clicks, browsing
• Cheap storage has made possible to maintain this data
• Need to analyze the raw data to extract knowledge
• Big data
• “Big data is a relative term-data today are big by reference to the past, and to the methods
and devices available to deal with them. The challenge big data presents is often
characterized by the four V’s-volume, velocity, variety, and veracity. Volume refers to the
amount of data. Velocity refers to the flow rate-the speed at which it is being generated and
changed. Variety refers to the different types of data being generated (currency, dates,
numbers, text, voice, audio etc. ). Veracity refers to the fact that data is being generated by
organic distributed processes (e.g., millions of people signing up for services or free
downloads)” (Shmueli, Patel, and Bruce 2016).

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Why Data Mining?-Cont.

▪ The data is also very

• Multiple types of data: tables, time
series, images, graphs, etc. Cyber Security
• Spatial and temporal aspects
• Interconnected data of different
o From the mobile phone we can
collect: location of the user,
Traffic Patterns Social Networking: Twitter
friendship information, check-ins
to venues, opinions through
twitter, images though cameras,
queries to search engines

Source: lecture notes of Tan, Steinbach, Karpatne, and Kumar, 2018 Computational Simulations
Sensor Networks
Curtin University is a trademark of Curtin University of Technology
CRICOS Provider Code 00301J
Why Data Mining?-Cont.
▪ Great opportunities to improve productivity in all walks of life

Source: lecture notes of Tan, Steinbach, Karpatne, and Kumar, 2018

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Why Data Mining?-Cont.
▪ Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by increasing

Finding alternative/ green energy sources
agriculture production
Source: lecture notes of Tan, Steinbach, Karpatne, and Kumar, 2018

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Example: Transaction Data

▪ Billions of real-life customers:

• WALMART: 20M transactions per day
• AT&T 300 M calls per day
• Credit card companies: billions of transactions per day.

▪ The point cards allow companies to collect information about

specific users

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Example: Document Data

▪ Web as a document repository: estimated 50 billions of

web pages
▪ Wikipedia: 4 million articles (and counting)
▪ Online news portals: steady stream of 100’s of new
articles every day
▪ Twitter: ~300 million tweets every day

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Example: Network Data
▪ Web: 50 billion pages linked via hyperlinks
▪ Facebook: 500 million users
▪ Twitter: 300 million users
▪ Instant messenger: ~1billion users
▪ Blogs: 250 million blogs worldwide,
presidential candidates run blogs

Image source:

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Behavioral Data
▪ Mobile phones today record a large amount of information about
the user behavior
• GPS records position
• Camera produces images
• Communication via phone and SMS
• Text via Facebook updates
• Association with entities via check-ins
▪ Amazon collects all the items that you browsed, placed into your
basket, read reviews about, purchased.
▪ Google and Bing record all your browsing activity via toolbar
plugins. They also record the queries you asked, the pages you
saw and the clicks you did.
▪ Data collected for millions of users on a daily basis

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Example Data Sources
Genomic sequences project
▪ Climate data (just one example)

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
How Is the Data Stored?
▪ Data can be stored in databases by various organisations e.g.:
• CRM’s (Customer Relationship Management systems)
• Customer records (Banks, medical, etc.)
• Government records (MyGov, Centrelink, Tax Office)

▪ Other storage such as data matrix – similar to a single huge table

in a database
▪ Can also be stored as HTML links, Web graphs, item sets,
strings and vectors
▪ And many other types of storage

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
What Can You Do With The Data?
▪ Suppose that you are the owner of a supermarket and you have
collected billions of market basket data. What information would
you extract from it and how would you use it?
TID Items
Product placement
1 Bread, Coke, Milk
2 Beer, Bread Catalog creation
3 Beer, Coke, Chocolate, Milk
4 Beer, Bread, Chocolate, Milk Recommendations
5 Coke, Chocolate, Milk

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
What Can You Do With The Data?-Cont.
▪ Suppose you are a stockbroker and you observe the fluctuations
of multiple stocks over time. What information would you like to
get us of your data?
Clustering of stocks

Correlation of stocks

Stock Value prediction

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
What Can You Do With The Data?-Cont.
▪ You are the owner of a social network, and you have full
access to the social graphs, what kind of information do you
want to get out of your graphs?
• Who is the most important node
in the graph?
• What is the shortest path
between two nodes?
• How many friends two nodes
have in common?
• How does information spread on
the network?

Source: Cao 2020

Curtin University is a trademark of Curtin University of Technology
CRICOS Provider Code 00301J
What Can We Do With the Data?-Cont.
▪ Artificial Intelligence
• “Machine learning is often based on data mining. An artificial intelligence might
develop theories about its problem space and then use data mining to build
confidence in the theory. For example, a self-driving car that observes a white van
drive by at twice the speed limit might develop the theory that all white vans drive
fast. The AI can then use a data mining technique to determine if the theory is worth
maintaining. ”

Source: Google, Images

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
What Can We Do With the Data?-Cont.
▪ Marketing
▪ A product development group is designing a package for a pair
of running shoes. A designer has a theory that men’s shoes with
pink packaging tend to sell better. The team uses a data mining
tool to see if this idea has any historical support.
• Pricing
• Target markets
• Branding
• Opportunities
• Target customers
• Locations
• Competitors
• Benchmarks
• Risks
Source: Google, Images
Curtin University is a trademark of Curtin University of Technology
CRICOS Provider Code 00301J
What Can We Do With the Data?-Cont.
▪ Research
▪ A researcher has a well-developed theory that air pollution is
associated with higher incidence of dementia. The researcher
builds a database with US air quality readings and the addresses
of dementia patients for several states. They use a data mining
tool to explore the association.
▪ Details:

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
What Can We Do With the Data?-Cont.
▪ Farming
▪ A farmer develops a theory that standard recommendations for
the amount of water required by tomato plants is excessive. She
uses a data mining tool to explore tomato yield and irrigation
An interesting article: Does AI
Hold the Key to a New and
Improved “Green Revolution” in

Source: Google, Images

Curtin University is a trademark of Curtin University of Technology
CRICOS Provider Code 00301J
What Can We Do With the Data?-Cont.
▪ Some Examples:
• Frequent item sets and Association Rules extraction
• Coverage
• Clustering
• Classification
• Ranking
• Exploratory analysis
Source: Cao, 2020

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Frequent Itemsets and Association Rules
▪ Given a set of records each of which contain some number of
items from a given collection;
• Identify sets of items (itemsets) occurring frequently together
• Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.

TID Items Itemsets Discovered:

1 Bread, Coke, Milk {Milk,Coke}
2 Beer, Bread {Diaper, Milk}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk Rules Discovered:
5 Coke, Diaper, Milk {Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Source:Tan, P.N., Steinbach, M. and Kumar, V.,
2016. Introduction to data mining. Pearson Education India.

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Frequent Itemsets: Applications
▪ Text mining: finding associated phrases in text
There are lots of documents that contain the phrases “association rules”,
“data mining” and “efficient algorithm”

▪ Recommendations:
Users who buy this item often buy this item as well
Users who watched James Bond movies, also watched Jason Bourne

Recommendations make use of item and user similarity

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Association Rule Discovery: Application
▪ Supermarket shelf management.
• Goal: To identify items that are bought together by
sufficiently many customers.
• Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
• A classic rule --
o If a customer buys diaper and milk, then he is very likely to buy
o So, don’t be surprised if you find six-packs stacked next to diapers!

Source: lecture notes of Tan, Steinbach, Karpatne, and Kumar, 2018

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Clustering Definition
▪ Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
• Data points in one cluster are more similar to one another.
• Data points in separate clusters are less similar to one another.
▪ Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Intra-cluster distances are
distances are maximized

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J

Source: lecture notes of Tan, Steinbach, Karpatne, and Kumar, 2018

Clustering: Application
▪ Applications of Cluster Analysis
• Understanding
o Group related documents for browsing, group genes and proteins that have similar
functionality, or group stocks with similar price fluctuations
• Summarization
o Reduce the size of large data sets
Clustering precipitation in Australia

▪ Document Clustering:
• Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
• Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the frequencies of
different terms. Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to relate a new
document or search term to clustered documents.

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
▪ Given a set of customers and items and the transaction
relationship between the two, select a small set of items that
“covers” all users.
• For each user there is at least one item in the set that the user has bought.

▪ Application:
• Create a catalog to send out that has at least one item of interest for every

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Classification: Definition
▪ Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is the class.
▪ Find a model for class attribute as a function of the values of
other attributes.

▪ Goal: previously unseen records should be assigned a class as

accurately as possible.
• A test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it.

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Examples of Classification

Source: lecture notes of Tan, Steinbach, Karpatne, and Kumar, 2018

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Examples of Classification-Cont.

Tid Refund Marital Taxable Refund Marital Taxable

Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test

7 Yes Divorced 220K No

8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes Model

Set Classifier

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Examples of Classification-Cont.
▪ Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
o Use credit card transactions and the information on its account-holder as
▪ When does a customer buy, what does he buy, how often he pays
on time, etc.
o Label past transactions as fraud or fair transactions. This forms the class
o Learn a model for the class of the transactions.
o Use this model to detect fraud by observing credit card transactions on an

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Examples of Classification-Cont.
▪ Fraud Detection
• Business problem: Fraud increases costs or reduces revenues
• Solution: Use logistic regression, neural nets to identify characteristics of fraudulent
cases to prevent in future or prosecute more vigorously
• Benefits: Increased profits by reducing undesirable customers

▪ Automobile Insurance Bureau of Massachusetts

• Past reports on claims adjustors scrutinized by experts to identify cases of fraud
• Several characteristics (over 60) of claimant, type of accident, type of
injury/treatment coded into database
• Dimension reduction methods used to obtain weighted variables. Multiple regression
step-wise subset selection methods used to identify characteristics strong correlated
with fraud

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Examples of Classification-Cont.
▪ Risk Analysis
• Business problem: Reduce risk of loans to delinquent customers
• Solution: Use credit scoring methods using discriminant analysis to create score
functions that separate out risky customers
• Benefits: Decrease in cost of bad debts

▪ Clicks to Customers
• Business problem: 50% of Dell’s clients order their computer through the web.
However, the retention rate is 0.5%, i.e., of visitors of Dell’s web page become
• Solution: Through the sequence of their clicks, cluster customers and design
website, interventions to maximize the number of customers who eventually buy.
• Benefits: Increase revenues.

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Examples of Classification-Cont.
▪ Recommendation Systems
• Business Opportunities: Users rate items (,,
on the web. How to use information from other users to infer ratings for a particular user?
• Solution: Use of a technique known as collaborative filtering
• Benefits: Increase revenues by cross selling, up selling

▪ Emerging Major Data Mining Applications

• Spam
• Bioinformatics/Genomics
• Medical History Data-Insurance Claims
• Personalization of services in e-commerce
• RF Tags: Gillette
• Security: Container Shipments; Network Intrusion Detection

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
Data Mining Cycle for Business

Curtin University is a trademark of Curtin University of Technology

CRICOS Provider Code 00301J
▪ Hand, D.J., Mannila, H. and Smyth, P. (2001) Principles of Data Mining. MIT
▪ Witten I.H. and E. Frank (2000) Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations. Morgan Kaufmann.
▪ Hand, D.J., Kelly, M.J., Blunt, G. and Adams, N.M. (2000) Data mining for
fun and profit. Statistical Science, 15(2), 111–126.
▪ Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C.
(2017). Data mining for business analytics: concepts, techniques, and
applications in R. John Wiley & Sons.
▪ Leskovec, J., Rajaraman, A., & Ullman, J. D. (2019). Mining of massive data
sets. Cambridge university press.
▪ Tan, P.N., Steinbach, M. and Kumar, V., 2016. Introduction to data mining.
Pearson Education India
Curtin University is a trademark of Curtin University of Technology
CRICOS Provider Code 00301J
I hope you enjoyed your lecture!

You might also like