Nothing Special   »   [go: up one dir, main page]

November 2010)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

-- the selected data sets,

I have selected Bank marketing in UCI Machine Learning Repository. I


found there multiple files containing personal information and bank details
[Moro et al., 2011].

-- Please include this citation if you plan to use this database:


[Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining
for Bank Direct Marketing: An Application of the CRISP-DM Methodology.
In P. Novais et al. (Eds.), Proceedings of the European Simulation and
Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal,
October, 2011. EUROSIS.
Datasets are available at: [pdf] http://hdl.handle.net/1822/14838[bib]
http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt Created by: Paulo
Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012
The full dataset was described and analyzed in:
S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct
Marketing: An Application of the CRISP-DM Methodology.
In P. Novais et al. (Eds.), Proceedings of the European Simulation and
Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal,
October, 2011. EUROSIS.
-- exploration with and preparation of the data sets,
The marketing campaigns were based on phone calls. Often, more than one
contact to the same client was required, in order to access if the product (bank
term deposit) would be (or not) subscribed.
1) bank-full.csv with all examples, ordered by date (from May 2008 to
November 2010).
2 bank.csv with 10% of the examples (4521), randomly selected from bank-
full.csv.
3. The smallest dataset is provided to test more computationally demanding
machine learning algorithms (e.g. SVM).
4.The classification goal is to predict if the client will subscribe a term deposit
(variable y).
5. Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)
6. Number of Attributes: 16 + output attribute.
7. Attribute information: For more information, read [Moro et al., 2011].

Firstly, I will do a data preprocessing for machine learning. Find if there is


null value in the datasets. If there is, I will do a dropna and drop the missing
data for the preprocessing.
After that, I will see if all numbers are in integer format, I will convert all data
to integer for all column and standardized data into analytical-able to work.

-- intended use of the data sets


# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: "admin.", "unknown", "unemployed",
"management","housemaid","entrepreneur","student" "blue-collar", "self-
employed", "retired", "technician", "services")
3 - marital : marital status (categorical: "married","divorced","single"; note:
"divorced" means divorced or widowed)
4 - education (categorical: "unknown","secondary","primary","tertiary")
5 - default: has credit in default? (binary: "yes","no")
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: "yes","no")
8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical:
"unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: "jan", "feb", "mar", ...,
"nov", "dec")
12 - duration: last contact duration, in seconds (numeric)
# other attributes:
13 - campaign: number of contacts performed during this campaign and for
this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the client was last contacted
from a previous campaign (numeric, -1 means client was not previously
contacted)
15 - previous: number of contacts performed before this campaign and for this
client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical:
"unknown","other","failure","success")
Output variable (desired target):
17 - y - has the client subscribed a term deposit? (binary: "yes","no")
8. Missing Attribute Values: None

Logistic regression model:


For another example, suppose that you work for a financial institution. The risk
assessment department is tasked with evaluating whether a potential client is
likely to default on their credit card repayments for the institution to make an
informed decision about granting credit to customers. To conduct this
evaluation, the logistic regression model can be used. By training the model, it
can predict the probability of a client defaulting on their credit card repayments
based on historical data of customers with similar characteristics. These
characteristics can include job, marital, age, education, etc.
The scatterplot displays the annual incomes and monthly credit card
balances of many individuals. The individuals who have defaulted on their
credit card repayments are shown in red, whereas the individuals who have not
defaulted are indicated in blue. From the scatterplot, could deduce that
individuals with a higher credit card balance are more likely to default.
However, to get a more definitive indication, box plots can be used to better
understand the relationship between the independent variables. The first box
plot maps the relationship between the balances of defaulting and non-
defaulting individuals. It shows that the balance of defaulting individuals,
shown in red, is much higher than that of the non-defaulting individuals,
indicated in blue. The second box plot maps the relationship between the
income of defaulting and non-defaulting individuals. It appears that there is no
real difference between defaulting and non-defaulting individuals in relation to
their annual income.

(Example)
Looking at the scatterplot above, we can visualise the task of classification by
essentially seeking line or a curve to split the space of the feature variables
income and balance so that the model can suggest a separation between the
orange and the blue points. This separation can be used to classify a future
point for which it would be unknown whether it is orange or blue. Note that
there is overlap between the areas of the orange and blue points implying that
there will be some error in this separation, in other words some blue points will
be classified as orange and vice-versa. This line or curve that we seek to
separate the feature space is typically a function of the features and is called
discriminant function.
-- proposed plans for the modelling exercises
Based on my observation, I decided to consist the following three tasks that can
be implemented on one or more of your chosen real-world datasets.
1. Unsupervised Learning: where the problem consists of identifying
homogeneous population groups or dimension reduction techniques,
which can then be used in the context of the empirical application
2. Regression: where the problem consists of continuous target variable(s).
3. Classification: where problem consists of categorical target variable(s)

In Unsupervised Learning, I decide to use unlabelled training data, i.e., data


samples that have no corresponding target variable. The goal of unsupervised
learning is to discover underlying structures or patterns from the data. Common
unsupervised learning tasks include grouping) and dimensionality reduction
(reducing the dimensionality of the data for visualisation or improving
clustering (dividing similar data samples into different computationally
efficient). Typical unsupervised learning algorithms include K-mean
clustering, hierarchical clustering, principal component analysis (PCA)

In Regression, I decide to use the column data(xxx)- data is undecided,

Linear Regression: This is one of the simplest and most commonly used
regression algorithms. It assumes a linear relationship between the dependent
and independent variables.

Lasso and Ridge regression are both regularization techniques used in linear
regression to prevent overfitting and improve the model's generalization
performance.
Ridge Regression (L2 Regularization): Ridge regression is a regularized linear
regression algorithm that adds a penalty term to the linear regression objective
function to prevent overfitting.
Lasso Regression (L1 Regularization): Similar to ridge regression, lasso
regression is a regularized linear regression algorithm that uses the absolute
values of the coefficients as a penalty term. It can be useful for feature
selection.

In Classification, I plan to use logistic regression, random forest and logistic


regression to prove. If we have learnt neural network before the coursework, I
will also try to apply neural network.

-- any other
Because the main focus is the unsupervised learning in this coursework, so I
will focus on the PCA (Principal Component Analysis), this can be done by
import scikit-learn from python library and import StandardScaler and PCA.

After that, I will get the variance and explained variance for the PCA.
Moreover, it is also possible that to get principal component and explained
variance ratio. I will create a graph with the x-axis(Principal Components) and
y-axis with Explained variance ratio. After that, I will probably create a scatter
graph that figure out PCA components. The relationship for this scatter graph
will tell us probably the result.

I will also conduct KNN(K-Nearest Neighbours) which is also one of the


other unsupervised learning algorithms. I will probably plot the data first to see
the rough distribution of the variable. After that, to decide how many numbers
of clusters we have, I will import Kmeans from Scikit-Learn Cluster. I would
like to give a for Loop and then check out which Sum of Squared estimator is
low for different types of k means value.

As we learn from the subject guide and class, I will also import an elbow
method for finding the optimal k. There are many ways to determine the k
value, common methods include Elbow method (hand-time method): By
observing the relationship between the sum of squares of errors (up-crossing)
and the k-value under different k-values, select the inflection point (elbow
point) as the k-value.
Contour Coefficient (Contour Coefficient): Calculates the contour coefficient
for different C values and selects the C value corresponding to the largest
contour coefficient. Cross-validation: Divide the data set into training set and
test set, select the best k-value by clustering the training set and calculating the
corresponding evaluation indexes on the test set.This can also be done by
plotting with the axis(k) and Sum of Squared distances.
I got a question, I want to apply feature selection and how can I apply to the
case?

You might also like