03 Machine Learning Lab Guide-Student Version

Artificial Intelligence Technology and Application
Machine Learning
Lab Guide
Teacher Version
Huawei Technologies CO., LTD.

Contents
1 Feature Engineering on Banks' Private Credit Data ....................................................... 1

1.1 Introduction ................................................................................................................................................................................ 1
1.1.1 About This Lab ....................................................................................................................................................................... 1
1.1.2 Objectives ................................................................................................................................................................................ 1
1.1.3 Case Background................................................................................................................................................................... 1
1.2 Data Preprocessing .................................................................................................................................................................. 2
1.2.1 Background ............................................................................................................................................................................. 2
1.2.2 Procedure ................................................................................................................................................................................. 2
1.3 Feature Selection ...................................................................................................................................................................... 5
1.3.1 Background ............................................................................................................................................................................. 5
1.3.2 Feature Selection Methods ................................................................................................................................................ 5
1.3.3 Filter ........................................................................................................................................................................................... 5
1.3.4 Wrapper .................................................................................................................................................................................... 8
1.3.5 Embedded ................................................................................................................................................................................ 8
1.3.6 Variable Removal ................................................................................................................................................................10
1.4 Feature Construction .............................................................................................................................................................10
1.4.1 Background ...........................................................................................................................................................................10
1.4.2 Polynomial Feature Construction ..................................................................................................................................10
2 Real-Time Recommendation Practice for Retail Products ..........................................12
2.1 Introduction ..............................................................................................................................................................................12
2.1.1 About This Lab .....................................................................................................................................................................12
2.2 Procedure ..................................................................................................................................................................................12
2.2.1 Preparing E-commerce Platform Data ........................................................................................................................12
2.2.2 Recommending Products Based on the Product Popularity................................................................................17
2.2.3 Recommending Products Based on Collaborative Filtering ................................................................................21
3 Private Credit Default Prediction .....................................................................................24
3.1 Introduction ..............................................................................................................................................................................24
3.1.1 Objectives ..............................................................................................................................................................................24
3.1.2 Background ...........................................................................................................................................................................24
3.2 Procedure ..................................................................................................................................................................................25
3.2.1 Reading Data ........................................................................................................................................................................25
3.2.2 Viewing Missing Values ....................................................................................................................................................25
3.2.3 Splitting the Dataset ..........................................................................................................................................................26
3.2.4 Standardizing Data (Preprocessing Data) .................................................................................................................27
3.2.5 Handling the Class Imbalance Issue (Preprocessing) ............................................................................................27
3.2.6 Performing Grid Search (Modeling) ............................................................................................................................27
3.2.7 Verifying Performance (Evaluation) ............................................................................................................................28
3.2.8 Saving the Model ................................................................................................................................................................29
4 Survival Prediction of the Titanic .....................................................................................30
4.1 Introduction ..............................................................................................................................................................................30
4.1.1 About This Lab .....................................................................................................................................................................30
4.1.2 Objectives ..............................................................................................................................................................................30
4.1.3 Datasets and Frameworks ...............................................................................................................................................30
4.2 Procedure ..................................................................................................................................................................................31
4.2.1 Importing Related Libraries .............................................................................................................................................31
4.2.2 Importing Datasets .............................................................................................................................................................31
4.2.3 Preprocessing Data .............................................................................................................................................................35
4.2.4 Building a Model .................................................................................................................................................................37
5 Linear Regression .................................................................................................................39
5.1 Introduction ..............................................................................................................................................................................39
5.1.1 About This Lab .....................................................................................................................................................................39
5.1.2 Objectives ..............................................................................................................................................................................39
5.2 Procedure ..................................................................................................................................................................................39
5.2.1 Preparing Data .....................................................................................................................................................................39
5.2.2 Defining Related Functions .............................................................................................................................................40
5.2.3 Starting Iteration .................................................................................................................................................................41
5.3 Thinking and Practices .........................................................................................................................................................45
5.3.1 Question 1 .............................................................................................................................................................................45
5.3.2 Question 2 .............................................................................................................................................................................45
6 Flower Category Analysis ...................................................................................................46
6.1 Introduction ..............................................................................................................................................................................46
6.1.1 About This Lab .....................................................................................................................................................................46
6.2 Experiment Code ....................................................................................................................................................................46
6.2.1 Importing Related Libraries .............................................................................................................................................46
6.2.2 Importing a Dataset ...........................................................................................................................................................46
6.2.3 Splitting the Dataset ..........................................................................................................................................................46
6.2.4 Performing Modeling ........................................................................................................................................................47
6.2.5 Effect After Data Preprocessing.....................................................................................................................................49
7 Emotion Recognition of Customer Evaluations in the Retail Industry ....................50
7.1 Introduction ..............................................................................................................................................................................50
7.1.1 Objectives ..............................................................................................................................................................................50
7.1.2 Case background .................................................................................................................................................................50
7.2 Procedure ..................................................................................................................................................................................51
7.2.1 Data Management .............................................................................................................................................................51
7.2.2 Data Reading ........................................................................................................................................................................51
7.2.3 Data Processing ...................................................................................................................................................................53
7.2.4 Model Training.....................................................................................................................................................................55
8 Boston Housing Price Forecast ..........................................................................................60
8.1 Introduction ..............................................................................................................................................................................60
8.1.1 About This Lab .....................................................................................................................................................................60
8.1.2 Objectives ..............................................................................................................................................................................60
8.1.3 Experiment Dataset and Framework ...........................................................................................................................60
8.2 Procedure ..................................................................................................................................................................................61
8.2.1 Introducing the Dependency ..........................................................................................................................................61
8.2.2 Loading the Dataset, Viewing Data Attributes, and Visualizing Data ............................................................62
8.2.3 Splitting and Preprocessing the Dataset ....................................................................................................................63
8.2.4 Performing Modeling on the Dataset by Using Various Regression Models ................................................63
8.2.5 Adjusting Grid Search Hyperparameters ....................................................................................................................64
9 E-commerce Website User Group Analysis.....................................................................67
9.1 Introduction ..............................................................................................................................................................................67
9.1.1 About This Lab .....................................................................................................................................................................67
9.2 Experiment Code ....................................................................................................................................................................67
9.2.1 Using sklearn for Modeling .............................................................................................................................................67
9.2.2 Selecting the Optimal k Value .......................................................................................................................................70
Machine Learning Lab Guide-Student Version Page 1
1 Feature Engineering on Banks' Private

Credit Data
1.1 Introduction
1.1.1 About This Lab
Feature engineering is a process of extracting features from raw data. Data and features
determine the upper limit of machine learning, while models and algorithms help
continuously approaching this upper limit. Feature engineering and construction aim to
enable extracted features to represent the essential characteristics of data to the greatest
extent, so that a model constructed based on these features has a good prediction effect
on unknown datasets.
1.1.2 Objectives
Upon completion of this task, you will be able to:
 Master the Python-based feature selection method.
 Master the Python-based feature extraction method.
 Master the Python-based feature construction method.
1.1.3 Case Background

With the development of online financial services, bank H plans to evaluate customer risks
by using online approval to reduce labor costs and improve approval efficiency. Online
approval requires a more strict and accurate risk control model to control corporate
financial risks. Therefore, algorithm engineer A needs to complete feature engineering and
construction of a credit risk model before constructing the model based on historical
customer credit data. Engineer A needs to complete the following operations:
 Data preprocessing
 Feature selection
 Feature construction
1.2 Data Preprocessing

1.2.1 Background
The raw data collected by the back-end server of the bank may have problems such as
missing values, garbled characters, redundant fields, and inconsistent data formats. To
improve the data quality, engineer A needs to cleanse the data first.
1.2.2 Procedure
1.2.2.1 Importing Data
Code:
import pandas as pd
Output:
1.2.2.2 Processing Missing Values

Step 1 View the missing values.
The missing values in the data may be caused by machine faults, manual input errors, or
service attributes. The method for processing the missing values varies with the cause.
missingno is a tool for visualizing missing values. You can run the following command to
view the missing-value distribution in the data:
Code:
import missingno # Import the missingno package.
Output:
As shown in the figure above, the Nation, Marriage_State, Highest Education,

House_State, Industry, Title, and Duty fields contain a large number of missing values. In
Pandas, isnull() can determine the missing values in data, and isnull().sum() can count
the number of missing values and further check the rates of the missing values in the fields.
Code:
Output:
Step 2 Fill the missing values with the mode.
Pandas provides fillna() to fill the missing values, and mode() to fill the missing values
with the mode. You need to construct a for loop to process multiple fields that contain
missing values and fill the missing values with the mode.
# Define the list of fields with missing values.
# Use the for loop to process the missing values in the multiple fields.
After the processing is complete, check the missing rate of each field.
The following are methods for handling missing values:

1. Direct deletion: deletes the non-important service fields with the missing rate greater
than 80%.
2. Data filling: fills the missing values with the determined values, statistical indicator
values, and algorithm-based missing value predictions.
3. Separate processing: processes samples with missing values as one category.
----End
1.3 Feature Selection

1.3.1 Background
If excessive features are obtained after data preprocessing, the model may be unstable and
have poor generalization capability. As a result, the computing complexity increases
exponentially. Therefore, engineer A needs to preliminarily filter out features that are not
important to the prediction result.
1.3.2 Feature Selection Methods

The following are methods for feature selection:
 Filter: filters features based on the statistical indicators for the feature vectors and
those common to the feature vectors and the target variables.
 Wrapper: attempts to use different feature subsets for modeling and use the model
precision as an evaluation indicator for the feature subsets.
 Embedded: evaluates feature weights during model training and scores the
importance of the features.
1.3.3 Filter
Step 1 Analyze the crosstab.
Apply the crosstab() method to draw a crosstab by using the variable House_State and
the target variable Target as an example.
In the output, the default rate is 0.019 when House_State is set to 1, and is 0.045 when
House_State is set to 2. If the default rates are considered the same, the variable
House_State does not affect the default prediction.
The crosstab analysis can only be used for preliminary judgment and analysis. The chi-
square test is further needed to determine whether the numerical difference has statistical
significance.
Step 2 Perform the chi-square test.
Separate independent variables and dependent variables from the raw data, and select
categorical variables from the independent variables.
The Target field is a target variable and is assigned to y. The column with the target
variable removed is assigned to X as an independent variable. X_category indicates a
categorical variable.
Import the chi-square test package chi2 of sklearn.feature_selection and use chi2() to
calculate the chi-square values of each categorical variable and target variable.
Step 3 Test the continuous variable correlation.
If two continuous independent variables are highly correlated, delete one of the two
independent variables or extract common information from the two independent variables.
The method parameter indicates the method for calculating the correlation coefficient.
The options are as follows:
 pearson: Pearson correlation coefficient.

 kendall: correlation coefficient for unordered categorical variables.
 spearman: Spearman correlation coefficient, which is mainly used for correlation
analysis of non-linearly and non-normally distributed data.
import matplotlib.pyplot as plt

import seaborn as sns
Calculate the correlation coefficient between continuous independent variables and select
the combination of independent variables whose correlation coefficient is greater than 0.8.
----End
1.3.4 Wrapper
In the wrapper selection method, different feature subsets are used for modeling, the
model precision is used as the evaluation indicator for the feature subsets, and a base
model is selected to perform multi-round training. After each round of training, features
of some weight coefficients are removed, and then the next round of training is performed
based on the new feature set. The RFE() method of the feature_selection submodule in
sklearn is invoked. The logistic regression model LogisticRegressio() is used as the base
model to be invoked, and parameter will be transferred into this model.
Wrapper:
estimator: basic training model, which is a logistic regression model in this example.
n_features_to_select: indicates the number of retained features.
fit(X,y): invokes and trains a model.
from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression
Output:
20
[ True True False True True True False True True True True False
False True False True True False True True True True True True
True False False False True False]
[ 1 1 9 1 1 1 10 1 1 1 1 6 3 1 11 1 1 8 1 1 1 1 1 1
1 5 4 7 1 2]
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='warn',
tol=0.0001, verbose=0, warm_start=False)
The return value of the RFE() method is output, which can be any of the following:
 n_features_: number of selected features, that is, the value of the
n_features_to_select parameter transferred into the RFE() method.
 support_: indicates that the selected features are displayed at their corresponding
positions. True indicates that the feature is retained, and False indicates that the
feature is removed.
 ranking_: indicates the feature ranking. ranking_[i] corresponds to the ranking of
the ith feature. The value 1 indicates the optimal feature. The selected features are
the 20 feature corresponding to the value 1, namely, the optimal features.
 estimator_: returns the parameters of the base model.
1.3.5 Embedded
The embedded method uses a machine learning model for training to obtain weight
coefficients of features, and selects features in descending order of the weight coefficients.
Common embedded methods are based on either of the following:

 Linear model and regularization
 Feature selection of a tree model
In the tree model, the importance of a feature is determined by the depth of the leaf node.
In this experiment, the random forest is used to calculate the importance of a feature.
The random forest classification method in the sklearn.ensemble submodule is invoked to
train the model by using the fit(X,y) method.
from sklearn.ensemble import RandomForestClassifier
After the model training is complete, the weight evaluation value of each feature is printed.
Output:
[(0.1315, 'Ast_Curr_Bal'),
(0.1286, 'Age'),
(0.0862, 'Year_Income'),
(0.0649, 'Std_Cred_Limit'),
(0.043, 'ZX_Max_Account_Number'),
(0.0427, 'Highest Education'),
(0.0416, 'ZX_Link_Max_Overdue_Amount'),
(0.0374, 'ZX_Max_Link_Banks'),
(0.0355, 'Industry'),
(0.0354, 'ZX_Max_Overdue_Duration'),
(0.0311, 'ZX_Total_Overdu_Months'),
(0.0305, 'Marriage_State'),
(0.0305, 'Duty'),
(0.0292, 'Couple_Year_Income'),
(0.0279, 'ZX_Credit_Max_Overdu_Amount'),
(0.0246, 'ZX_Max_Overdue_Account'),
(0.0241, 'ZX_Max_Credit_Banks'),
(0.0221, 'ZX_Max_Credits'),
(0.0205, 'Birth_Place'),
(0.0195, 'Loan_Curr_Bal'),
(0.0173, 'L12_Month_Pay_Amount'),
(0.015, 'ZX_Credit_Max_Overdue_Duration'),
(0.013, 'Title'),
(0.0097, 'ZX_Credit_Total_Overdue_Months'),
(0.0096, 'Nation'),
(0.0084, 'Gender'),
(0.0079, 'Work_Years'),
(0.0064, 'ZX_Max_Overdue_Credits'),
(0.0059, 'House_State'),
(0.0, 'Couple_L12_Month_Pay_Amount')]
1.3.6 Variable Removal

Based on the results of the preceding three variable selection methods, the variables that
have a small model impact weight are removed.
1.4 Feature Construction

1.4.1 Background
Feature selection is used to find the importance of each feature to model construction and
remove variables that have little impact on model construction to reduce dimensions.
Feature construction is to manually discover some significant model features from raw
data. On the one hand, a new variable can be constructed by combining several different
features based on service understanding; on the other hand, variables of different time
windows can be divided according to a time attribute.
Engineer A has completed the preliminary filtering of features, removed some variables
that have little impact on the model, and now attempts to construct some new features to
improve the model precision.
1.4.2 Polynomial Feature Construction

Polynomial feature construction is to explore the impact of complex variables on the target
variable through product transformation for existing features. The PolynomialFeatures()
method in the sklearn.preprocessing submodule is used for feature interaction. Variables
with higher scores in the model, that is, Ast_Curr_Bal, Age, Year_Income, and
Std_Cred_Limit, are selected to construct a polynomial feature.
PolynomialFeatures(degree=3): specifies the degree 3 of interaction between variables,
that is, constructs a polynomial whose sum of powers of two variables is 3.
from sklearn.preprocessing import PolynomialFeatures

# Select fields used to construct the polynomial feature.
# Invoke the fit() method to construct the polynomial feature.
# Convert data.
To check the correlation between the newly generated variable and the target variable,
construct a dataset containing the target variable and the newly generated variable first.
The corr() function is used to calculate the correlation coefficient between the newly
generated variable and the target variable.
Output:
Five features with the smallest correlation coefficients:

Age^3 -0.010601
Age^2 -0.009275
Age^2 Std_Cred_Limit -0.008064
Age -0.007356
Age Std_Cred_Limit -0.006834
Name: Target, dtype: float64
Five features with the largest correlation coefficients:
Year_Income^3 -0.001910
Ast_Curr_Bal Age -0.001114
Ast_Curr_Bal 0.002849
Target 1.000000
1 NaN
Name: Target, dtype: float64
2 Real-Time Recommendation Practice

for Retail Products
2.1 Introduction
Mr. Zhao works in the AI algorithm department of e-commerce platform company A and
is responsible for product recommendation for online businesses. In the modern world of
the Internet and e-commerce, people are overwhelmed by data that provides useful
information. However, it is impossible for users to extract the information they are
interested in from the data. To help users find product information, the recommendation
system can create similarities between users and products and provide suggestions for
customers based on the similarities. The recommendation system is beneficial in:
 Helping users find the right products.
 Increasing user engagement. Providing recommendations. For example, Google News
saw a 40% increase in hits due to recommendations.
 Helping project providers deliver projects to the right users. At Amazon, 35% of
products are sold through recommendations.
 Helping personalize the recommended content. In Netflix, most rented movies are
recommended ones.
2.2 Procedure
2.2.1 Preparing E-commerce Platform Data
Step 1 Import the required packages.
Functions in the NumPy library are used to perform basic operations on arrays. Pandas
provides many data processing methods and time sequence operation methods.
# Import module packages required by the project.

import numpy as np
import pandas as pd
Step 2 Read data.

Step 3 Check the data overview.
View the format of the read data. You can use the head() function to check the first five
rows of the data to get a rough understanding of the data content.
Step 4 View the data size.
You can further view the data size (the number of samples and the number of features in
the data) by using the shape function.
Step 5 View the data type.
After learning of the data size, you still need to view the data type by using the dtypes()
function to facilitate subsequent data calculation.
According to the result, only Rating and timestamp fall into the numeric type and can be
used for mathematical calculation. If userId and productId need to be used for
mathematical calculation, convert the types of them. In addition, you can use the info()
function to view the general information about the data.
The result contains the number of data samples, feature type, data type, and data storage
size. The info function can display the preceding information by default, but you can set
an item to False to hide the item. For example, you can run the following command to
hide the data storage size:
Step 6 View the product ratings from users.
Product ratings are important data that can reflect users' preference. The data is critical to
an efficient recommendation system. You can use the describe function to check the data
overview of the numeric type. To view only the preliminary data analysis of Rating, add
the corresponding column name in square brackets to the end of the command.
The result contains the average value, maximum value, minimum value, standard deviation,
and quartile of the data, and the product rating is generally about 4. You can use the min()
and max() functions to print the maximum and minimum value of the rating.
You can also use the print() function to print the result or the value of a parameter.
According to the result, the highest rating is 5, indicating that users' ratings on the product
are generally high.
Step 7 View the default values of the data.
The most important factors that affect data quality are default values and abnormal values.
As the ratings all fall within the normal ranges, you need to use the isnull() function to
check whether the parameter is null, and then use the sum() function to count the total
number of non-null parameters.
Step 8 Check whether the users and products are unique.
A user can rate multiple products. Similarly, a product can be rated by different users. To
determine the product types and the number of users, you need to check whether the users
and products are unique.
Step 9 Delete time information.
You can use the drop() function to delete the product rating time.
axis: deletes the column name part when it is set to 1, and deletes the index number part
when it is set to 0.
inplace: indicates the operation result when it is set to True.
Step 10 Analyze the rating data.
Sort the users and products by rating and view the sorting result.
groupby(): performs matching based on the specific data.
sort_values(): sorts a group of data.
ascending: ascending order.
After obtaining the product data corresponding to the sorted user ratings, the system
returns the quantiles by using the quantile() function, and displays the quantiles by icons.
# quantiles with 0.05 difference
# quantiles with 0.25 difference

----End
2.2.2 Recommending Products Based on the Product Popularity

Sorting the products by rating to analyze the product popularity during data preparation
helps implement recommendation based on the product popularity.
Step 1 Sort products.
Similar to user sorting, products can be sorted based on the rating data to obtain products
that have been rated for more than 50 times.
Step 2 Sort the products by the average rating.
Calculate the average rating of each product, and then sort the products based on the
average rating.
# Calculate the average rating of each product.
# Obtain the rankings of the products sorted by the number of rating times.
The result shows that the product with the highest average rating is rated by 1051 users.
Step 3 Visualize the result.
Analyze the product rankings and display the result in a chart. Specifically, use a histogram
first to display the distribution of the number of users who rate each type of products.
hist(): histogram
bins: number of buckets in the histogram.
Sort the products by the number of users who rate the products, to obtain the product
popularity.
----End
2.2.3 Recommending Products Based on Collaborative Filtering

Recommending products based on the product popularity is not enough to meet the actual
requirement. Therefore, the currently popular collaborative filtering method is used to
implement recommendation.
Step 1 Create a table of relationships between products and users.
Select 10,000 samples and use pivot_table() to create a table of relationships between
products and users.
You can use the shape function to view the table size, and then transform the table. The
data in the table is the product ratings from users.
# View the data size.
Step 2 Decompose the table.
You can use the SVD algorithm to reduce the dimensions of the table to obtain 10
important product-based features.
from sklearn.decomposition import TruncatedSVD # Import the SVD algorithm.

# Construct an SVD model to combine the number of features (that is, the number of columns) into
10 important combined features.
Step 3 Build a correlation coefficient matrix.
Calculate product similarities to implement a product-based recommendation system.

corrcoef(): calculates the correlation coefficient between data.
Step 4 Recommend products based on the product similarities.

Randomly select a product, select products whose coefficient of correlation with the
selected one is greater than 0.65, and recommend these products to users who like the
selected one.
# Determine whether the product is unique.
# View the number of similar products in the similarity table.
#Select products whose coefficient of correlation with the 20th product is greater than 0.65.
# Delete the 20th product.
# Recommend products ranked ahead to the users who like the 20th product.
As shown in the result, there are eight products whose coefficient of correlation with the
20th product (9984984354) is greater than 0.65. You can also select other products to view
their similar products.
----End
3 Private Credit Default Prediction
3.1 Introduction
Under the impact of the Internet, financial institutions are suffering from internal and
external troubles. On one hand, financial institutions encounter great competition and
performance pressure from large financial and technology enterprises; on the other hand,
more and more criminal groups use artificial intelligence (AI) technologies to increase the
crime efficiency. These risk details are hidden in each transaction phase. If they are not
prevented, losses will be irreparable. Therefore, financial institutions pose increasingly high
requirements on risk management accuracy and approval efficiency.
This experiment will discuss the problem and perform practice step by step from the
perspectives of problem statement, breakdown, priority ranking, solution design, key point
analysis, and summary and suggestions, and cultivate the project implementation thinking
and implement analysis of the private credit default prediction from scratch.
3.1.1 Objectives
 Understand the significance of credit default prediction.
 Master the development process of big data mining projects.
 Master the common algorithms for private credit default prediction.
 Understand the importance of data processing and feature engineering.
 Master the common methods for data preprocessing and feature engineering
 Master the algorithm principles of logistic regression and XGBoost, and understand
the key parameters.
3.1.2 Background
The case in this document is for reference only. The actual procedure may vary. For details,
see the corresponding product documents.
The company has just set up a project team for private credit default prediction. Engineer
A was appointed as the offline development PM of the project. This project aims to:
 Identify high-risk customers efficiently and accurately using new technologies.
 Make risk modes data-based by using scientific methods.
 Provide objective risk measurement.
 Reduce subjective judgments.
 Improve risk management efficiency.
 Save labor costs.

The ultimate goal is to productize the results, so that front-end operating departments can
identify transactions with credit default risks in a timely manner to avoid corporate losses.
3.2 Procedure
3.2.1 Reading Data
First, import the dataset. This document uses a third-party module from Pandas to import
the dataset.
import pandas as pd
# Use pd.read_csv to read the dataset. (The dataset is stored in the current directory so that it can be
read directly.)
# ./credit.csv indicates the current directory. The slash (/) here must be in the same direction as one
in a directory of the Linux operating system (OS).
# In the Windows OS, the backslash (\) is used. Therefore, the slash in the file path must be the same
as that in the Linux OS.
# Be aware of using the slash symbol in the same key on the keyboard as the question mark (?).
# An auxiliary module warnings can be imported.
# This module can help filter many redundant and annoying warnings.
# After data reading, some simple operations can be performed, for example:
# Run the following command to view all data.
data
# Run the following command to view the first 10 rows of data.
# Run the following command to view the length and width of data in the matrix format.
data.shape
3.2.2 Viewing Missing Values

# Check the data missing status in a visualized manner.
# The third-party library missingno is used.
# Many values are missing and need to be filled.

# There are many filling methods. The missing values can be filled with average values, medians, and
the mode.
#The numeric types include discrete and continuous.
# If the average values are used, a new discrete value may be generated. Therefore, the mode is used
#isnull() is used to determine whether a value is null. If yes, True is returned. If not, False is
returned.
# In Python, 1 is equal to True, and 0 is equal to False.
# Therefore, sum() is used for judgment. If the result is greater than 0, True is displayed.
# The features with missing values are placed in the missname list.
#fillna() is used to fill empty values with the mode.
3.2.3 Splitting the Dataset

Then, split the dataset. Before splitting the dataset, remove the index as it is unimportant
and interferes with the model judgment, and remove Target (result) which cannot be used
as input into the model.
X is equivalent to an independent variable in mathematics, and y is equivalent to a

dependent variable. Import the dataset splitting function to split the dataset.
from sklearn.model_selection import train_test_split
X_train is the training set, and y_train is the answer to the training set. X_test is the test
set, and y_test is the answer to the test set. test_size=0.1 indicates that the ratio of the
training set to the test set is 9:1. shuffle indicates that the training set and test set are
shuffled.
3.2.4 Standardizing Data (Preprocessing Data)

After the dataset is split, standardize the data.
from sklearn.preprocessing import StandardScaler
the standardization function StandardScaler() is first declared. The following fit function
is used to obtain the standard deviation and average value of the dataset. Then, transform
is used to transform the data.
3.2.5 Handling the Class Imbalance Issue (Preprocessing)

Next, handle the class imbalance issue, which is mainly about the difference between the
number of positive results and the number of negative results of the data. In this dataset,
the number of defaulters (represented by 1) is very small, and the number of non-
defaulters (represented by 0) is very large. Therefore, the model tends to determine people
as non-defaulters due to class imbalance. Check the current result ratio first.
from collections import Counter
# Use collections in the standard library to query the results.

# Import the third-party library imblearn.
from imblearn import over_sampling
# X_train_fix and y_train_fix are the corrected data.

# Next, check the number of samples.
# Check the corrected y_train_fix instead of the original y_train.
3.2.6 Performing Grid Search (Modeling)

from sklearn.model_selection import cross_val_score
# Declare the logistic regression algorithm and set max_iter (the maximum number of training
times) to 500.
# Perform judgment based on the cross verification thinking to help split the dataset.
# cv=5 indicates that the dataset is split into five equal parts.
# Apply grid search to find the optimal parameters through traversal.

# Import the grid search module.
from sklearn.model_selection import GridSearchCV
# C indicates the regularization coefficient.
# solvers indicates the optimization method.

# Combine the regularization coefficient with the optimization method using the dictionary method.
# Declare the logistic regression algorithm.
# Declare the grid search algorithm and describe the cross verification method.
# Perform training.
# Check the optimal accuracy.
# Check which parameters are optimal.
3.2.7 Verifying Performance (Evaluation)

Then, verify the result.
from sklearn.metrics import accuracy_score,precision_score,recall_score,roc_auc_score

# Use the obtained optimal parameters for modeling.
# Construct a function to return the verification result.
# Output the performance data of the training set.
# Output the performance data of the test set.

3.2.8 Saving the Model

Save the model.
# Import the joblib library

import joblib
# The dump function is used to save models. Enter the trained model and the model name into the
dump function to save the model.
# The model file must be suffixed with .pkl.
# Load the saved model again.
# After the model is loaded, use the model for prediction directly.
4 Survival Prediction of the Titanic
4.1 Introduction
This experiment is to predict whether passengers on the Titanic can survive based on the
Titanic datasets.
4.1.2 Objectives
 Use the Titanic datasets open to the Internet as the model input data.
 Build, train, and evaluate machine learning models
 Understand the overall process of building a machine learning model.
4.1.3 Datasets and Frameworks

This experiment is based on train.csv and test.csv. test.csv contains the result about
whether the passengers can survive. This dataset has no target, that is, no result, and can
be used as a real-world dataset. Involved parameters are as follows:
 PassengerId: passenger ID
 Pclass: cabin class (class 1/2/3)
 Name: passenger name
 Sex: gender
 Age: age
 SibSp: number of siblings/number of spouses
 Parch: number of parents/number of children
 Ticket: ticket No.
 Fare: ticket price
 Cabin: cabin No.
 Embarked: port of boarding
4.2 Procedure
4.2.1 Importing Related Libraries
import pandas as pd
import numpy as np
import random as rnd

%matplotlib inline

from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
4.2.2 Importing Datasets

Step 1 Read data.
Step 2 View data.
The first five rows of data are displayed.
The last five rows of data are displayed.

The data overview helps check whether some data is missing and what the data type is.
The related numeric-type information of the data helps check the average value and other
statistics.
The character-type information helps check the number of types, the type with the
maximum value, and the frequency.
Step 3 Check the survival probability corresponding to each feature based on statistics.
The intuitive data shows that passengers in class 1 cabins are more likely to survive.
The survival probability can be directly determined by the number of siblings.
When the survival probability is determined by gender, an obvious imbalance occurs.
As shown in the preceding figure, most young passengers died.
The following figure shows the survival probability determined based on the cabin and age.
----End
4.2.3 Preprocessing Data

As the datasets have missing values, combine the datasets, and fill the missing values with
data.
Step 1 Combine the datasets.
Step 2 Check for missing values.

Step 3 Fill the missing values with data.
Process the datasets by using different methods as required. For example, fill the Fare and
Embarked parameters having few missing values with the mode.
Use the average age value.
Delete less significant data. Before this, assign a value to Target first.
Check whether missing values still exist.
Step 4 Convert data.
Convert some character-type data into numeric-type data for model input. To do so, check
the number of types first.
Use the search function to obtain each character-type value and replace it with a numeric-
type value.
test.csv cannot be used as a training test set as it does not contain Target. train.csv
contains 891 pieces of data (with Target), which need to be extracted.
----End
4.2.4 Building a Model

This section describes how to build a model. To build a model, split the training set and
test set.
Step 1 Split the dataset.
Step 2 Train a model.
The logistic regression algorithm, random forest algorithm, and AdaBoost are used for
training.

from sklearn import ensemble
As shown above, the random forest algorithm has a good effect.
Step 3 Predict data.

----End
5 Linear Regression
5.1 Introduction
This experiment uses the basic Python code and the simplest data to reproduce how a
linear regression algorithm iterates and fits the existing data distribution.
The NumPy and Matplotlib modules are used in the experiment. NumPy is used for
calculation, and Matplotlib is used for drawing.
5.1.2 Objectives
 Be familiar with basic Python statements.
 Master the procedure for implementing linear regression.
5.2 Procedure
5.2.1 Preparing Data
Randomly set ten pieces of data, with the data in a linear relationship.
Convert the data into an array format so that the data can be directly calculated when
multiplication and addition are used.
Code:
# Import the required modules NumPy for calculation and Matplotlib for drawing.
import numpy as np
#This code is used only for Jupyter Notebook.
%matplotlib inline
# Define data and convert the list into an array.
# Display the data through the scatter chart.

Output:
Figure 5-1 Scatter chart
5.2.2 Defining Related Functions

Model function: defines the linear regression model wx+b.
Loss function: calculates the mean square error.
Optimization function: calculates the partial derivatives of w and b by using the gradient
descent method.
Code:
# The basic linear regression model is wx+b. In this example, the model is ax+b as a two-dimensional
space is used.
# The mean square error loss function is the most commonly used loss function in the linear
regression model.
# The optimization function mainly uses the partial derivatives to update a and b.
# Perform function iteration to return a and b.

5.2.3 Starting Iteration

Step 1 Initialize the iterative optimization model.
Code:
# Initialize and display parameters.
Output:
Figure 5-2 First iteration
Step 2 Perform the second iteration and display the parameter values, loss values, and
visualization effect.
Code:
Output:
Figure 5-3 Second iteration
Step 3 Perform the third iteration and display the parameter values, loss values, and
Code:
Output:
Figure 5-4 Third iteration
Step 4 Perform the fourth iteration and display the parameter values, loss values, and
Code:
Output:
Figure 5-5 Fourth iteration
Step 5 Perform the fifth iteration and display the parameter values, loss values, and visualization
effect.
Code:
Output:
Figure 5-6 Fifth iteration
Step 6 Perform the 10000th iteration and display the parameter values, loss values, and
Code:
Output:
Figure 5-7 10000th iteration
----End
5.3 Thinking and Practices

5.3.1 Question 1
Must the loss value return to zero when the raw data is modified?
5.3.2 Question 2
What is the function of Lr during Lr modification?
6 Flower Category Analysis
6.1 Introduction
This experiment uses a dataset with a small sample quantity. The dataset includes the
open-source Iris data provided by scikit-learn. The Iris prediction project is a simple
classification model. By using this model, you can understand the basic usage and data
processing methods of the machine learning library sklearn.
6.2 Experiment Code

6.2.1 Importing Related Libraries
import numpy as np
6.2.2 Importing a Dataset

The dataset is the built-in data of sklearn. Therefore, no external dataset needs to be
imported.
from sklearn.datasets import load_iris
According to the preceding code, x is specified as a feature, and y as a label. The dataset
includes a total of 150 samples and four features: sepal length, sepal width, petal length,
and petal width.
6.2.3 Splitting the Dataset

Split the data into a training set and a test set.
View the data size after the splitting.

Data size before the splitting:
Data size after the splitting:
6.2.4 Performing Modeling

6.2.4.1 Logistic Regression
Import the algorithm model to be used.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
Logistic regression is used for modeling first. The one-vs-one (OvO) multiclass method is
used for logistic regression by default.
As described above, logistic regression has a good effect.
6.2.4.2 SVM
Use the Support Vector Machine (SVM) for classification. The one-vs-the-rest (OvR)
multiclass method is used for the SVM by default.
6.2.4.3 Decision Tree

Use the decision tree algorithm.
6.2.4.4 K-Nearest Neighbors Algorithm

Use the k-nearest neighbors algorithm.
Three neighbors are set for the k-nearest neighbors algorithm. Another number of
neighbors can be tried for better accuracy.
Therefore, the recursion method is used to find the optimal number of neighbors.
As shown in the figure above, the k-nearest neighbors algorithm has the optimal effect
when there is one nearest neighbor.
6.2.5 Effect After Data Preprocessing

Consider data standardization before modeling.
After standardization, the standard deviation is 1, and the mean value is infinitely close to
0.
Then, use the SVM to perform modeling after the standardization. Change the data names
of the training set and test set to new ones.
As described above, the SVM precision is also improved after the standardization.
7 Emotion Recognition of Customer

Evaluations in the Retail Industry
7.1 Introduction
Emotion analysis is a classification technology based on natural language processing (NLP),
and is usually used in classification methods for extracting emotional content of texts.
Compared with related recommendation and precision marketing, users prefer to view or
listen to the personal experience and feedback of users of the same type. For example,
evaluations from users who have purchased similar products and comparison results from
users who have used similar products can bring bidirectional values to users and enterprises.
This experiment will discuss the problem and perform practice step by step from the
perspectives of problem statement, breakdown, priority ranking, solution design, key point
analysis, and summary and suggestions, and cultivate the project implementation thinking
and implement analysis of the evaluation emotion analysis project from scratch.
7.1.1 Objectives
 Clarify the function and business value of emotion analysis.
 Understand the differences between conventional machine learning and deep
learning in emotion analysis methods.
 Clarify label extraction methods for emotion analysis.
 Master deep learning-based emotion analysis methods.
 Understand future applications of emotion analysis.
7.1.2 Case background

The case in this document is for reference only. The actual procedure may vary. For details,
see the corresponding product documents. Data engineer A works in the market data
analysis department of a Hi-Tech company. The company plans to develop home appliance
services, such as smart TV and smart reader, but it does not know how the current market
is and how users evaluate such products. Therefore, the company wants the data
department to output the market data survey report as soon as possible. Engineer A
considers using the NLP technology to analyze users' evaluation tendency and evaluation
keywords of competitors' products of the same type, and build an emotion prediction
model to predict the users' emotion tendency based on texts.
7.2 Procedure
7.2.1 Data Management
The following information is involved:
 Id: ID
 reviews.rating: score
 reviews.text: text evaluation
 reviews.title: evaluation keywords
 reviews.username: name of the evaluator
This dataset contains 21 attribute fields and 34,657 data samples. The experiment aims to
analyze customer evaluation data. Therefore, this document describes only the data
attributes required in this experiment.
7.2.2 Data Reading

After obtaining the provided data files, you need to read and view the data over Python by
performing the following steps:
Step 1 Import common library files such as sklearn, pandas, and numpy.
sklearn is a powerful third-party machine learning library of Python. It contains data in

various aspects from data preprocessing to model training. Most functions in the sklearn
library are classified into estimators and transformers. An estimator is equivalent to a
modeling tool, and is used to predict data. Common estimator functions include fit(x,y)
and predict(x). A transformer is used to process data, such as reducing dimensions and
standardizing data. Common transformer functions include transform(x) and
transform(x,y).
import pandas as pd
import matplotlib as mpl
import nltk.classify.util
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.classify import NaiveBayesClassifier
import numpy as np
import re
import string
import nltk
%matplotlib inline
Step 2 Read data from a local disk.

Step 3 Visualize the data.
Visualize the first five rows of data and view the data attribute columns.
Output:
This experiment uses only the reviews.rating, reviews.text, reviews.username, and

reviews.title attribute columns. Therefore, you need to extract data from the dataset for
the second time to retain only the attribute data in the four columns, and name the
extracted attribute data file permanent to facilitate the subsequent experiment procedure.
View the missing values.
Output:
The reviews.rating attribute column is indispensable to emotion analysis. The dataset

contains 34,657 data samples. The data volume is large. Therefore, you can delete the data
samples with the reviews.rating value missing. Specifically, you can extract the data
without the reviews.rating value and name the data senti, and extract the data with the
reviews.rating value and name the data check.
With respect to score processing, this experiment defines data samples with the
reviews.rating value greater than or equal to 4 as positive (pos) and those with the
reviews.rating value less than 4 negative (neg), and renames the reviews.rating attribute
column senti.
replace(x,y): replaces x with y.
Visualize the data after identifying the samples as positive or negative.
Output:
The output shows that the data is unbalanced.
7.2.3 Data Processing

A regular expression is used to check whether a string matches a pattern. The re module
is added in Python 1.5 and later versions. The re module enables Python to have all regular
expression functions:
 re.sub(): match item in the string to be replaced.
re.sub(pattern, repl, string, count=0, flags=0)
 pattern: pattern string in the regular expression.
 repl: character string to be replaced, which can also be a function.
 string: original character string to be searched for and replaced.
 count: maximum number of replacements after pattern matching. The default value
is 0, indicating that all matches need to be replaced.
Data slicing is to use pandas.DataFrame.sample to randomly select several rows of data.
 n: number of rows to be extracted.

 frac: proportion of rows to be extracted. If frac is set to 0.8, 80% of the rows need to
be extracted.
 replace: indicates whether extraction is performed with replacement. The value True
indicates that extraction is performed with replacement.
 random_state: seed of the random number generator. If random_state is set to
None, the obtained data is not repeated.
Step 1 Import related packages.
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
import numpy as np
import re
import string
import nltk
Step 2 Preprocess data.
Text data includes spaces, punctuation marks, and data. This experiment focuses on the
text (English) analysis. Therefore, you need to delete the information other than letters.
You can define the cleanup() function, use a regular expression to delete non-letter
characters, use the lower() function to convert uppercase letters into lowercase ones, and
delete spaces, including '\n', '\r', '\t', and ' '. After apply() is used, the reviews.text
attribute is saved as the summary_clean column.
Step 3 Generate a training set and a test set.
Obtain ["Summary_Clean","senti"] from the senti dataset and save it as the split dataset.
Output:
Use 80% of data in split as the training set through split.sample(), remove the data that
has been used in the training set train from split through drop(), and use the remaining
data as the test set test.
Output:
7.2.4 Model Training

7.2.4.1 Model Selection
Emotion analysis of customer evaluations is essentially a classification problem, which can
be solved by using a classification model. Practice has proved that a Naive Bayes model
based on all words performs well in solving some problems, while a model using a word
subset performs well in solving other problems. Logistic regression (LR), multinomial NB,
and Bernouli NB are selected based on comprehensive consideration.
7.2.4.2 Model Calculation and Evaluation

The Naive Bayes classifier usually uses three models: Gaussian model, Polynomial model,
and Bernoulli model. The three models respectively correspond to functions GaussianNB(),
MultinomialNB(), and BernoulliNB() in sklearn.
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
 GaussianNB() is used when a feature is a continuous variable.

 MultinomialNB() is used when a feature is a discrete variable.
 BernoulliNB() is used when a feature is a discrete variable and the feature can be
set only to 1 or 0.
The model calculation phase consists of the following steps:
Step 1 Import libraries.
from wordcloud import STOPWORDS

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
Step 2 Build and evaluate a model.
Convert the data in the training set, test set, and verification set into a list and create
indexes.
Set all words in train["words"] to True and add neg or pos to the end of a sentence based
on the scoring criteria.
Use a trained classifier to attach emotion labels to the test set and verification set to predict
whether words in the test set and verification set are positive or negative.
Output:
Output:
The original dataset check does not contain review.ratings data. As shown in the
preceding figure, whether each word is negative or positive is predicted after the classifier
is created based on the training set.
from sklearn.naive_bayes import MultinomialNB
Use the CountVectorizer class to perform vectorization, invoke the TfidfTransformer class
to perform preprocessing, construct the term frequency (TF) vector, and calculate the
importance of words. The training set, test set, and verification set are obtained, which are
X_train_tfidf, X_test_tfidf, and checktfidf, respectively.
The main idea of TF is as follows: If a word or phrase has a high TF in an article but a low
TF in other retail articles, the word or phrase is considered to have a good class
distinguishing capability. TF-IDF tends to filter out commonly used words and retain
important words.
The CountVectorizer class converts words in the text into a TF matrix, and uses the
fit_transform() function to calculate the number of appearance times of each word. In
general, you can use CountVectorizer to extract features and then use TfidfTransformer to
calculate the weight of each feature.
Retain important words based on tfidf_transformer to construct the MultinomialNB model.
Output:
Retain important words based on tfidf_transformer to construct the BernoulliNB model.
from sklearn.naive_bayes import BernoulliNB
Output:
Retain important words based on tfidf_transformer to construct the LR model.
from sklearn import linear_model
Output:
In comparison, the LR model has higher accuracy than the other two models.
Step 3 Verify the model.
Select the LR model for verification.

Output:
The classifier accurately provides the positive probability and negative probability of each
sentence.
Step 4 Build a word cloud.
from wordcloud import WordCloud, STOPWORDS
Output:
----End
8 Boston Housing Price Forecast
8.1 Introduction
This experiment uses a dataset with a small sample quantity. The dataset includes the
open-source Boston housing price data provided by scikit-learn. The Boston housing price
forecast project is a simple regression model. By using this model, you can understand the
basic usage and data processing methods of the machine learning library sklearn.
8.1.2 Objectives
 Use the Boston housing price dataset open to the Internet as the model input data.
 Build, train, and evaluate machine learning models
 Understand the overall process of building a machine learning model.
 Master the application of machine learning model training, grid search, and
evaluation indicators.
 Master the application of related APIs.
8.1.3 Experiment Dataset and Framework

This experiment is based on the Boston housing price dataset, which contains 506 samples
with 13 features. Each data record contains detailed information about the house and its
surroundings. To be specific, the dataset includes the following features:
 CRIM: per capita crime rate by town
 ZN: proportion of residential land zoned for lots over 25,000 sq.ft
 INDUS: proportion of non-retail business acres per town
 CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
 NOX: Nitric oxide concentration
 RM: average number of rooms per dwelling
 AGE: proportion of owner-occupied units built prior to 1940
 DIS: weighted distances to five Boston employment centers
 RAD: index of accessibility to radial highways
 TAX: full-value property-tax rate per $10,000
 PTRATIO: pupil-teacher ratio by town
 B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town

 LSTAT: % lower status of the population
The target is to obtain the median value of owner-occupied homes in the unit of $1000.
The sklearn framework is used to provide the Boston housing price data and functions
such as dataset splitting, standardization, and evaluation, and integrate various common
machine learning algorithms. In addition, XGBoost optimized from gradient boosted
decision tree (GBDT) is used as the integral algorithm.
8.2 Procedure
8.2.1 Introducing the Dependency
Code:
#Prevent unnecessary warnings.

import warnings
warnings.filterwarnings("ignore")
#Introduce the basic package of data science.

import numpy as np
import matplotlib as mpl
import pandas as pd
import scipy.stats as st
##Set attributes to prevent garbled characters in Chinese.
#Introduce machine learning, preprocessing, model selection, and evaluation indicators.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
#Import the Boston dataset used this time.

from sklearn.datasets import load_boston
#Introduce algorithms.
from sklearn.linear_model import RidgeCV, LassoCV, LinearRegression, ElasticNet
#Compared with SVC, it is the regression form of SVM.
from sklearn.svm import SVR
#Integrate algorithms.
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
8.2.2 Loading the Dataset, Viewing Data Attributes, and

Visualizing Data
Step 1 Load the Boston housing price dataset and display related attributes.
Code:
#Load the Boston house price dataset.
#x features, and y labels.
#Display related attributes.
Output:
Feature column names: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' RM' 'AGE' DIS' 'RAD' 'TAX' PTRATIO' 'B'
'LSTAT'], sample quantity: 506, feature quantity: 13, target sample quantity: 506
Step 2 Convert the data into the data frame format
Code:
Output:
Figure 8-1 Information about the first five samples
Step 3 Visualize the label distribution.
Code:
Output:
Figure 8-2 Target data distribution
----End
8.2.3 Splitting and Preprocessing the Dataset

Code:
#Segment the data.
#Standardize the dataset.
Output:
Arrays: ([–0.35451414, –0.49503678, –0.15692398, ..., –0.01188637, 0.42050162, –0.29153411], [–

0.38886418, –0.49503678, –0.02431196, ..., 0.35398749, 0.37314392, –0.97290358], [0.50315442, –
0.49503678, 1.03804143, ..., 0.81132983, 0.4391143, 1.18523567], ..., [–0.34444751, –0.49503678, –
0.15692398, ..., –0.01188637, 0.4391143, –1.11086682], [–0.39513036, 2.80452783, –0.87827504, ...,
0.35398749, 0.4391143, –1.28120919], [–0.38081287, 0.41234349, –0.74566303, ..., 0.30825326,
0.19472652, –0.40978832]])
8.2.4 Performing Modeling on the Dataset by Using Various

Regression Models
Code:
#Set the model name.
#Define the model.

# cv is the cross-validation idea here.
# Output the R2 scores of all regression models.
#Define the R2 scoring function.
#Traverse all models to score.
Output:
8.2.5 Adjusting Grid Search Hyperparameters

Step 1 Build a model.
Code:
'''
'kernel': kernel function
'C': SVR regularization factor
'gamma': 'rbf', 'poly' and 'sigmoid' kernel function coefficient, which affects the model performance
'''
#Use grid search and perform cross validation.
Output:
Step 2 Obtain the optimal parameters.
Code:
Output:
Optimal parameter list: {'C': 5, 'gamma': 0.1, 'kernel': 'rbf'}

Optimal model: SVR(C=5, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.1,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
Optimal R2 value: 0.797481706635164
Step 3 Visualize the output.
Code:
##Perform visualization.
#Set the canvas.
#Draw with a red solid line.
#Draw with a green solid line.
#Display in a diagram.
Output:
Figure 8-3 Visualized result
----End
9 E-commerce Website User Group

Analysis
9.1 Introduction
This experiment performs modeling based on the k-means algorithm by using the virtual
dataset automatically generated by sklearn to obtain user categories. It is a clustering
experiment, which can find out the method for selecting the optimal k value and observe
the effect in a visualized manner.
9.2 Experiment Code

9.2.1 Using sklearn for Modeling
Step 1 Import libraries.
import numpy as np
Step 2 Create a dataset.
Create virtual data for the algorithm model.
from sklearn.datasets import make_blobs
The built-in tool of sklearn is used to create the virtual data, which is scientific and
conforms to a normal distribution. Parameter settings are as follows:
 n_samples: set to 2000, indicating that 2000 sample points are set.
 centers: set to 2, indicating that the data actually has two centers.
 n_features: set to 2, indicating the number of features.
For ease of illustration in the coordinate system, only two features are used.
Step 3 Apply the k-means algorithm.
from sklearn.cluster import KMeans

n_clusters=5: indicates that five data clusters are expected. However, there are only two
data categories.
Step 4 Visualize the output.
Output:
Different data is generated each time. Therefore, the output diagram may be different
from that in the lab. To generate the same data, add the random_state parameter during
data generation.
In this example, random_state is set to 3. In this way, the same data can be generated for
the same data input.
Step 5 Use more features for comparison.
In this example, ten features are used to generate data, random_state is set to 30, and
there are three categories in theory.
----End
9.2.2 Selecting the Optimal k Value

In the preceding steps, the k value is manually set. In actual environments, the number of
centers is unknown. Therefore, you need to find the optimal k value.
import random
First, generate two random numbers ranging from 1 to 30 (indicating that the number of
true centers in the data is unknown), and use a random number of features.
Then, perform k-means clustering by using a recursive method. The .inertia_ attribute
returns the distance from the attribute point to the center.
Visualize the result by using a visualization tool.
The result varies each time due to impact of the random numbers. As shown in the
preceding figure, the turning point appears at the position corresponding to the value 21.
Therefore, 21 is the optimal k value.

03 Machine Learning Lab Guide-Student Version

Uploaded by

Copyright:

Available Formats

03 Machine Learning Lab Guide-Student Version

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Machine Learning Lab Guide-Student Version

Uploaded by

Copyright:

Available Formats

Artificial Intelligence Technology and Application

Huawei Technologies CO., LTD.

1 Feature Engineering on Banks' Private Credit Data ....................................................... 1

1 Feature Engineering on Banks' Private

1.1.3 Case Background

1.2 Data Preprocessing

1.2.2.2 Processing Missing Values

import missingno # Import the missingno package.

As shown in the figure above, the Nation, Marriage_State, Highest Education,

Step 2 Fill the missing values with the mode.

The following are methods for handling missing values:

1.3 Feature Selection

1.3.2 Feature Selection Methods

Step 2 Perform the chi-square test.

Step 3 Test the continuous variable correlation.

 pearson: Pearson correlation coefficient.

import matplotlib.pyplot as plt

from sklearn.feature_selection import RFE

Common embedded methods are based on either of the following:

from sklearn.ensemble import RandomForestClassifier

1.3.6 Variable Removal

1.4 Feature Construction

1.4.2 Polynomial Feature Construction

from sklearn.preprocessing import PolynomialFeatures

# Invoke the fit() method to construct the polynomial feature.

Five features with the smallest correlation coefficients:

2 Real-Time Recommendation Practice

# Import module packages required by the project.

Step 2 Read data.

Step 3 Check the data overview.

Step 4 View the data size.

Step 5 View the data type.

Step 6 View the product ratings from users.

Step 7 View the default values of the data.

Step 8 Check whether the users and products are unique.

Step 9 Delete time information.

Step 10 Analyze the rating data.

# quantiles with 0.05 difference

# quantiles with 0.25 difference

2.2.2 Recommending Products Based on the Product Popularity

Step 1 Sort products.

Step 2 Sort the products by the average rating.

# Calculate the average rating of each product.

Step 3 Visualize the result.

2.2.3 Recommending Products Based on Collaborative Filtering

Step 1 Create a table of relationships between products and users.

# View the data size.

Step 2 Decompose the table.

from sklearn.decomposition import TruncatedSVD # Import the SVD algorithm.

Step 3 Build a correlation coefficient matrix.

Calculate product similarities to implement a product-based recommendation system.

Step 4 Recommend products based on the product similarities.

# Determine whether the product is unique.

# View the number of similar products in the similarity table.

# Delete the 20th product.

3 Private Credit Default Prediction

 Save labor costs.

# An auxiliary module warnings can be imported.

3.2.2 Viewing Missing Values