Nothing Special   »   [go: up one dir, main page]

Foundations of Data Science - Unit 3

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18

Foundations of

Data Science
Unit 3
Acknowledgement
▪ Most of the slides in this presentation are taken from material provided by
▪ Han and Kimber (Data Mining Concepts and Techniques) and
▪ Tan, Steinbach and Kumar (Introduction to Data Mining)

Zarmeen
Spring 2021 2
Nasim
Announcements
▪ CS students are not allowed to take this course!

Zarmeen
Spring 2021 3
Nasim
From Business Problems
to Data Mining Tasks
Classification: Definition
▪ Given a collection of records (training set )
▪ Each record contains a set of attributes, one of the attributes is the class.

▪ Find a model for class attribute as a function of the values of other attributes.
▪ Goal: previously unseen records should be assigned a class as accurately as possible.
▪ A test set is used to determine the accuracy of the model. Usually, the given data set is divided into
training and test sets, with training set used to build the model and test set used to validate it.

Zarmeen
Spring 2021 5
Nasim
Classification
l
Example
l s
ric
a
r ic
a
u ou
g o g o t in l ass
t e te n c
ca ca co
Tid Refund Marital Taxable
Status Income Cheat
Refund Marital Taxable
1 Yes Single 125K No Status Income Cheat

2 No Married 100K No No Single 75K ?


3 No Single 70K No Yes Married 50K ?
4 Yes Married 120K No No Married 150K ?
5 No Divorced 95K Yes Yes Divorced 90K ? Test
6 No Married 60K No No Single 40K ?
Set
7 Yes Divorced 220K No No Married 80K ?
8 No Single 85K Yes
10

9 No Married 75K No Learn


Training Classifier Model
10 No Single 90K Yes Set
10

Zarmeen
Spring 2021 6
Nasim
Classification: Application 1
▪ Direct Marketing
▪ Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a
new cell-phone product.
▪ Approach:
▪ Use the data for a similar product introduced before.
▪ We know which customers decided to buy and which decided otherwise. This
{buy, don’t buy} decision forms the class attribute.
▪ Collect various demographic, lifestyle, and company-interaction related
information about all such customers.
▪ Use this information as input attributes to learn a classifier model.

Zarmeen
Spring 2021 7
Nasim
Classification: Application 2
▪ Fraud Detection
▪ Goal: Predict fraudulent cases in credit card transactions.
▪ Approach:
▪ Use credit card transactions and the information on its account-holder as
attributes.
▪ When does a customer buy, what does he buy, how often he pays on time, etc
▪ Label past transactions as fraud or fair transactions. This forms the class
attribute.
▪ Learn a model for the class of the transactions.
▪ Use this model to detect fraud by observing credit card transactions on an
account.
Zarmeen
Spring 2021 8
Nasim
Classification: Application 3
▪ Customer Attrition/Churn:
▪ Goal: To predict whether a customer is likely to be lost to a
competitor.
▪ Approach:
▪ Use detailed record of transactions with each of the past and
present customers, to find attributes.
▪ How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
▪ Label the customers as loyal or disloyal.
▪ Find a model for loyalty.

Zarmeen
Spring 2021 9
Nasim
Regression
▪ Predict a value of a given continuous valued variable based on the values of other variables,
assuming a linear or nonlinear model of dependency.
▪ Applications:
▪ Predicting sales amounts of new product based on advertising expenditure.
▪ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.

Zarmeen
Spring 2021 10
Nasim
Clustering Definition
▪ Given a set of data points, each having a set of attributes, and a similarity
measure among them, find clusters such that
▪ Data points in one cluster are more similar to one another.
▪ Data points in separate clusters are less similar to one another.

▪ Similarity Measures:
▪ Euclidean Distance if attributes are continuous.
▪ Other Problem-specific Measures.

Zarmeen
Spring 2021 11
Nasim
Illustrating Clustering
Intracluster distances Intercluster distances
are minimized are maximized

 Euclidean Distance Based Clustering in 3-D space.


Zarmeen
Spring 2021 12
Nasim
Clustering: Application 1
▪ Market Segmentation:
▪ Goal: subdivide a market into distinct subsets of customers where any subset
may conceivably be selected as a market target to be reached with a distinct
marketing mix.
▪ Approach:
▪ Collect different attributes of customers based on their geographical and
lifestyle related information.
▪ Find clusters of similar customers.
▪ Measure the clustering quality by observing buying patterns of customers in
same cluster vs. those from different clusters.

Zarmeen
Spring 2021 13
Nasim
Association Rule Discovery: Definition
▪ Given a set of records each of which contain some
number of items from a given collection; TID Items
1 Bread, Coke, Milk
▪ Produce dependency rules which will predict
2 Milk, Bread
occurrence of an item based on occurrences of
other items. 3 Coke, Diaper, Milk
4 Bread, Diaper, Milk
5 Coke, Diaper, Milk

Rule Discovered:
{Milk} --> {Coke}

Zarmeen
Spring 2021 14
Nasim
Association Rule Discovery: Application 1
▪ Marketing and Sales Promotion:
▪ Let the rule discovered be
{Bagels, … } --> {Potato Chips}
▪ Potato Chips as consequent => Can be used to determine what should be done
to boost its sales.
▪ Bagels in the antecedent => Can be used to see which products would be
affected if the store discontinues selling bagels.
▪ Bagels in antecedent and Potato chips in consequent => Can be used to see
what products should be sold with Bagels to promote sale of Potato chips!

Zarmeen
Spring 2021 15
Nasim
Association Rule Discovery: Application 2
▪ Supermarket shelf management.
▪ Goal: To identify items that are bought together by sufficiently
many customers.
▪ Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items.

Zarmeen
Spring 2021 16
Nasim
Analytics Example

Grouping items by similarity Clustering


Discovering relationships Association
between items rules
Determining relationship Regression
between outcome and the input
variables
Assigning label/class to records Classification

Zarmeen
Spring 2021 17
Nasim
Knime Demo

You might also like