AU2019101198A4

AU2019101198A4 - A statistical analysis method of mobile telecom data driven user loss prediction

Info

Publication number: AU2019101198A4
Application number: AU2019101198A
Authority: AU
Inventors: Jingchao Gu; Tiansheng Jin; Yarong Li; Xinyue ZHANG
Original assignee: Li Yarong Miss; Zhang Xinyue Miss
Current assignee: Li Yarong Miss; Zhang Xinyue Miss
Priority date: 2019-10-03
Filing date: 2019-10-03
Publication date: 2020-01-16
Anticipated expiration: 2027-10-03

Abstract

This invention is applied to the industry of internet and mobile telecom. This invention extracts data featuring the users' basic information, mobile data service bundle information, broadband network service information, online behaviors in last three months, dpi search keywords and users' suspension and cancellation condition from China Telecom, and studies the relationship between the aforementioned information and broadband customers loss situation by big data analysis method. Meanwhile, we intend to sum up this experience so as to extract feature model for providing strong theoretical and policy basis of operator company's data updating and application to the model use of future user loss prediction. [Data-], Data Grouping GI -G5 Data Data Cleaning suspension and Co clinceltoetr Datasetifo Varianesrine Antalysisno 3773MiB optima Aveageecodsie~nieson 1Conclusioniqu) andsPredictio0 Daase fo Vaiase typesusm~atis Number of varls321 165,Numericlue ToealMssi 33? 1%) 3 ssiBooleanes paye_ !rpehas 52357 1 7% ssg aus ti as381216 125%sissu. aues= Fig. 2

Description

TITLE

A statistical analysis method of mobile telecom data driven user loss prediction

FIELD OF THE INVENTION

This invention is in the field of internet and telecom industry and serves as a method for internet and mobile data user analysis and telecom broadband user loss prediction.

BACKGROUND

The competition for marketing share in the Internet telecom industry is becoming increasingly fierce, globally and especially in China. In China, there are three major mobile operators: China Mobile, China Telecom, and China Unicorn. In this competition, China Telecom focuses on the broadband network share, which is also experienced the internet-boosting, happened in China a few years ago. While recently, the development of the market is trended to be stable, the customer loss is becoming an issue faced by these three operators.

Under such circumstances, there is a pressing requirement for common problems and modeling experience to be summarized to provide the internet telecom operators with ideas for the prediction of high-risk broadband users. This is the Age of Big Data, but the database analyzing i

2019101198 03 Oct 2019 method used by the internet and telecom operators are still relevant backward and cannot matching the rapid demand for a tremendous amount of data processing.

Therefore, this invention extracts the data characteristics of telecom broadband user loss from China Telecom. This invention will provide an analysis method and a powerful theoretical and policy basis for China Telecom to update big data in the broadband market competition.

SUMMARY

This invention uses the user data in the last three months from China Telecom as the object of the study, as shown in Fig. 1. The data has been divided into six groups (the sixth group is the users’ suspension and cancellation information and going to be used for user chum rate analyzing by comparing with the rest five data sets separately) And then, we describe each field for a better understanding about the data’s characters and relationship and results to the user chum rate analysis.

Based on the result brief understanding of all data in the first step, we process the data and complement the outliers to make them easier to be read and input by Python. This invention portrays this procedure as Data Cleaning, including missing data complementary and elimination, data normalization, conversion processing, data re-coding, data segmentation, data consolidation and etc.

2019101198 03 Oct 2019

Then, use Pearson correlation coefficient analysis specialized in the suspension users, cancelled users and user’s loss to confirm optimal influencing characteristics.

At last, this invention use Bagging combinational algorithm to find the best classification method combo among GaussianNB, LogisticRegression, DecisionTreeClassifier, ExtraTreesClassifier, AdaBoostClassifier, MLPClassifier, RandomForestClassifier. And optimize the model we obtained.

DESCRIPTION OF DRAWING

Fig. 1 is the invention flow chart

Fig. 2 shows the overview of users’ basic data set

Fig. 3 shows the description of inner_date

Fig. 4 shows the normalizing of inner_date

Fig. 5 shows the Pearson correlation coefficient analysis of users’ basic data set

Fig. 6 shows the Spearman's rank correlation coefficient analysis of users’ basic data set

Fig. 7 shows the relationship between user chum rate and users age

Fig. 8 shows the description of is_active_user

Fig. 9 shows the description of send_amount

Fig. 10 shows the description of store_balance

Fig. 11 shows the data segmentation

2019101198 03 Oct 2019

Fig. 12 shows the Pearson correlation coefficient analysis between certain selected fields (1) (suspended users)

Fig. 13 shows the Pearson correlation coefficient analysis between certain selected fields (2) (suspended users)

Fig. 14 shows the Pearson correlation coefficient analysis between certain selected fields (3) (suspended users)

Fig. 15 shows the Pearson correlation coefficient between different industries (suspended users)

Fig. 16 shows the Pearson correlation coefficient between different regions (suspended users)

Fig. 17 shows the Pearson correlation coefficient analysis about star level and type of network (suspended users)

Fig. 18 shows the Pearson correlation coefficient between different development channels (suspended users)

Fig. 19 shows the Pearson correlation coefficient between various Web

Service Bundle (suspended users)

Fig. 20 shows the Pearson correlation coefficient analysis between certain selected fields (1) (cancelled users)

Fig. 21 shows the Pearson correlation coefficient analysis between certain selected fields (2) (cancelled users)

Fig. 22 shows the Pearson correlation coefficient analysis between certain selected fields (3)

2019101198 03 Oct 2019

Fig. 23 shows the Pearson correlation coefficient between different industries (cancelled users)

Fig. 24 shows the Pearson correlation coefficient between different regions (cancelled users)

Fig. 25 shows the Pearson correlation coefficient analysis about star level and type of network (cancelled users)

Fig. 26 shows the Pearson correlation coefficient between different development channels (cancelled users)

Fig. 27 shows the Pearson correlation coefficient between various Web

Service Bundle (cancelled users)

DESCRIPTION OF PREFERRED EMBODIMENT

Data Understanding

There are six groups of data set obtained from China Telecom with their user information for last three months:

1) Users’ basic data, abbreviates as user_base_data_a;

2) Users’ DPI data, abbreviates as user_dpi_data_a;

3) Users’ service bundle information, abbreviates as user_mix_data_a;

4) Users’ online behavior in the last three months, abbreviates as user_net_flux_data_a;

5) Users’ Broadband Networks data, abbreviates as user_net_data_a;

2019101198 03 Oct 2019

6) Suspension and cancellation of mobile service user list, abbreviates as user_state_data_a.

For each data set, we firstly elaborate each field for brief comprehension and use Python (using PyCharm to run) as the analysis tool to describe each field in the set. Counting the number of variables and observations, and missing values can indicate the method of data processing. The detailed information for every individual field shows that many missing data need to be filled in. Recodification and normalization are widely used in all of these data sets. Then, we use Pearson correlation coefficient analysis and Spearman's rank correlation coefficient to illustrate the relations between each two fields. And in the last part of this procedure, we combined each data set with the suspension and cancellation of mobile service user list to obtain the user chum rate comparison and to draw the conclusion of user chum prediction.

(1) Users’ Basic Data Set

User’s basic data set concludes the user’s general information and the service bundle’s basic information.

Table. 1 Description of the fields in users’ basic data set

Field Name

Field Meaning

2019101198 03 Oct 2019

USER_ID	User ID
LATNJD	City
AGE	User’s age
SEX	User’s gender
STARLEVEL	User’s star-rating level
CUSTTYPE	Client type
CORP_NAME	City-level branch
SUBST_NAME	County-level branch
BRANCH_NAME	Service fulfillment business hall
DEVELOP_CHANNEL_NEW	Agency (Channel Development)
WLLX	Housing type
HANGYTYPE	User’s career industry type
BASEOFFERRH	Service bundle type
PAYTYPE	Payment type
INNER_DATE	User’s account time
ACCESSTYPE	Network line access mode

Table. 1 elaborates all the fields name in this set by practical significance.

And Fig. 2 shows the overview result of the program running.

2019101198 03 Oct 2019

Taking the field inner_date as an example, this diagram in Fig. 3 shows description of the time duration for each existing user being in company’s service that we collected from last three month.

The data of this field is concentrated in certain categories, so it is necessary to pay attention to the fitting problem. This field can be converted into integer to improve the program processing speed. The maximum value of this field is 1406, that is 117 years if calculated by month, which can be considered as an outlier.

Fig. 4 is the normalizing of inner_date field. There is one outlier 1406 should be discarded.

Fig. 5 and Fig. 6 shows the Pearson correlation coefficient analysis and Spearman's rank correlation coefficient analysis among the fields in the first data set. According to the Pearson correlation coefficient and Spearman's rank correlation coefficient, we can have the following conclusions:

1) CORPID and LATN_ID have strong correlation, which means that “City-level branch” and “City” have strong correlation. Considering about more missing data in “City-level branch”, we eliminate this field;

2) AGE and INNER_DATE have correlation. The elder the user, the

2019101198 03 Oct 2019 longer the service time;

3) DEVELOP_CHANNEL_NEW and AGE have negative correlation;

4) DEVELOP_CHANNEL_NEW and INNER_DATE have strong negative correlation.

Fig. 7 shows the relationship between user chum rate and users age. We can conclude that the younger the age, the higher the rate of terminating the service. The users who are 40-60 years old are more stable than customers under 40 years old, which makes them the main targets of the company to reach out.

The following five data set are using the same method to describe.

Table. 2 to Table. 6 shows the elaboration for every field of rest five data sets:

(2) Users’ Service Bundle Information Set

Table. 2 Description of the fields in users’ service bundle information set

Field Name	Field Meaning
USERJD	User ID
KD_MIX_TYPE	Broadband service bundle type
OFFERNAME	Service bundle name
XYQLIMITDATE	Expiration month of yearly package

2019101198 03 Oct 2019

YCOFFERNAME	Deposit package name
YCEIMITDATE	Expiration date of the deposit package
FIMITDATENEW	Discount expiration date
YW_SPEED_VAFUE	Package network speed
TY_SPEED_VAEUE	Experienced network speed
ZD_SPEED_VAFUE	Maximum accessible broadband network speed
UPSPEEDMONTH	Month for last internet speeding up
ISTZG	Whether the user is eligible for replacing copper wire with optical fiber
ISMOREKD	Whether the user is eligible for internet speeding up
ISATTENSPEED	Whether the user’s main concern is network speed
UPSPEEDHANGY	Whether the user is in the industry field with the potential of internet speeding up requirement
ISVISTOTHERURL	Whether the user ever visited a competitor's web page

2019101198 03 Oct 2019

ISPHSTK	Whether the user opened this account with deposit
IS_ACTIVE_USER	Active user or not
STOREBALANCE	Deposit principal amount
SEND_AMOUNT	Upload data consumption

(3) Users’ Broadband Networks Data Set

Table. 3 Description of the fields in users’ broadband networks data set

Field Name	Field Meaning
USER_ID	User ID
NEW_MIX_TYPE	Service bundle type
CUST_MOBILE_NUMS	number of mobile phones owned by the user
CUST_GH_NUMS	number of fixed phones owned by the user
CUST_ITV_NUMS	number of iTV owned by the user
ZK-ARPU	mobile revenue
BRANDTYPE	Mobile telecom provider
BRAND	Mobile phone brand name
ZKOFFERNAME	Mobile data service bundle type
AMOUNT_09	Mobile phone data

2019101198 03 Oct 2019 (4) Users’ Online Behavior in the Last Three Months

Table. 4 Description of the fields in the set of users’ online behavior in the last three months

Field Name	Field Meaning
USERJD	User ID
CUSTARPU	General income in last month
ACCOUNTARPU	Broadband revenues in last month
NET_INNER_DUR	Total online time in last month
NET_TIMES	Gsesp in last month
NET_FLUX	Online data consumption in last month
RECV-FLUX	Downstream in last month
SEND_FLUX	Uploading in last month
CUST_ARPU_PRE2	General income in the month before last
ACCOUNTARPUPRE2	Broadband revenues in the month before last
NET_INNER_DUR_PRE2	Total online time in the month before last
NET_TIMES_PRE2	Gsesp in the month before last

2019101198 03 Oct 2019

NET_FLUX_PRE2	Online data consumption in the month before last
RECV_FLUX_PRE2	Downstream in the month before last
SEND_FLUX_PRE2	Uploading in the month before last
CUST_ARPU_PRE3	General income in the month of two month ago
ACCOUNTARPUPRE3	Broadband revenues in the month of two month ago
NET_INNER_DUR_PRE3	Total online time in the month of two month ago
NET_TIMES_PRE3	Gsesp in the month of two month ago
NET_FLUX_PRE3	Online data consumption in the month of two month ago
RECV_FLUX_PRE3	Downstream in the month of two month ago
SEND_FLUX_PRE3	Uploading in the month of two month ago

(5) Users’ DPI Data Set

2019101198 03 Oct 2019

Table. 5 Description of the fields in users’ DPI data set

Field Name	Field Meaning
user_id	User ID
keywords	Online searching keyword
keywords_search_cnt_work	Number of searching times on weekdays
keywords_visit_date_work	Number of online days on weekdays
keywords_search_cnt_holid	Number of searching times on holidays
keywords_visit_date_holid	Number of online days on holidays
is_holiday	Whether the keyword is the holiday

(6) Suspension and Cancellation of Mobile Service User List

Table. 6 Description of the fields in the set of suspension and cancellation of mobile service user list

Field Name	Field Meaning
user_id	User ID
State	Suspension and cancellation status

Data Cleaning

2019101198 03 Oct 2019

After all required data being collected up and described, it conies to the next step of analyzing and selecting. It plays a crucial role in the whole project since it determines the accuracy of the final results.

The collected data have a series of disturbing factors, it can only be analyzed after cleaning, by the way of eliminating the outlier and any unreasonable values.

There is the exact processing procedure and methods:

(1) Missing value:

Solution:

1) Missing value treatment: If the missing values are small, the stochastic forest model is used to fill in the missing values.

2) Elimination treatment: If the missing value exceeds 50%, discard it.

#note: Missing values, outliers, unknown data all should be classified as missing values (2) Processing of continuous value:

From the point of view of readability, the normalized Min-Max

Normalization method is adopted in this paper.

2019101198 03 Oct 2019

Transform. Map the result value to [0 - 1], The conversion function is as follows:

x — min x’ -----------max — mm

Max is the maximum value of sample data and min is the minimum value of sample data. The advantage of this method is that it is more tolerant.

Convertible, the drawback is that when new data is added, it may lead to changes in Max and min, which need to be redefined.

(3) Conversion processing:

For example, Fig. 8 is demonstrating the conversion of is_active_user into integer or Boolean type to improve conversion speed.

# Boolean type or integer:

(4) Processing of normalization and discretization:

Fig. 9 shows the maximum value of this field, send_amount, is 113590000. We should normalize the data of this field.

Fig. 10 shows the maximum value of store_balance is 44974, we need to do outlier processing, try normalizing processing.

2019101198 03 Oct 2019 (5) Extra examples of data processing:

Table. 7 List of data processing

Processing content	English field name	Loss rate	Processing content
Type conversion and missing value	user_id	0.0%	USER_BASE_DATA_A
Offer_id	1.7%	USER_NET_DATA_A
Y w_speed_value	1.7%	USER_NET_DATA_A
Take the three-month mean and normalize it	Cust_arpu		USER_NET_FLUX_DATA- A
Account_arup		USER_NET_FLUX_DATA- A
Net_inner_dur		USER_NET_FLUX_DATA- A
delete	New_imx_type	49.4%	USER_MIX_DATA-A
Send_fl^ux_P^re2		USER_NET_FLUX_DATA- A
Cust_arpu_pre3		USER_NET_FLUX_DATA- A
eliminate	Is_phs_tk	0.0%	USER_NET_DATA_A
Is_h°liday		USERDPIDATAA

2019101198 03 Oct 2019

	Corpid	12.5%	USER_BASE_DATA_A
normalization	Store_balance	0.0%	USER_NET_DATA_A
Send_amount	1.7%	USER_NET_DATA_A
code	Access type	0.0%	USER_BASE_DATA_A
wllx	0.0%	USER_BASE_DATA_A

data coding:

Treat discrete data respectively with thermal coding, by the use of specific control table data segmentation

Because of the large amount of data, slicing by variables is used in practice, and each single variable generates a band.

Indexed files, processed separately and then merged, are somewhat similar to MapReduce, shown as Fig. 11.

Data consolidation

For the cleaned data, three files are formed: train_l.txt-stop list, train_2.txt-disassemble and clear.

Single, train_ls.txt - Loss list (i.e., shutdown and disassembly meet one)

Feature engineering

2019101198 03 Oct 2019

Analysis of service suspending user characteristic

From this graph (Fig. 12), it is obvious that, the correlation between these variables is quite weak. Compared with rest of data, the correlation between inner_date and age_MaxMin is relatively strong.

Similar with previous chart, the correlation shown by this graph (Fig. 13) is not strong, as well.

Nevertheless, the relationship between income and the quantity of customers’ mobile phone is extremely strong.

In this table (Fig. 14), the correlation between ‘searching ration in weekdays’ and ‘average number of searching per day’ is strong.

Besides, ‘total number of searching’ and ‘average quantity of searching per day’ shows a Moderate correlation.

This chart (Fig. 15) demonstrates the relationships between different industries.

Among these industries, ‘industry 1’ and ‘industry 0’ show a Strong negative correlation. The correlation between rest of industries is so weak that can be ignored.

2019101198 03 Oct 2019

This chart (Fig. 16) shows the Pearson correlation between different regions.

The correlation between each region are all negative which numerical value is so small that can be ignored. But between ‘region 17’ and ‘region 6’, there is a weak correlation.

This is a chart (Fig. 17) about star level and type of network, ‘type 5’ and ‘star_level 7’ show a weak correlation, ‘type 0’ and ‘type 1’ shows a negative correlation

The graph (Fig. 18) shows the relationship between different development channels.

The correlation between most of develop channels can be ignored, however, between develop channel o and develop channel 11/9/8/6/5/3/1, the correlation is relative strong.

There are the correlations between various Web Service Bundle (Fig. 19).

There is weak correlation between ‘type 2’ and ‘type 0’ and a strong negative correlation between ‘type 1 ’ and ‘type 0’

Overall, the features of service suspending customers have some inner relationship, even if service suspending customers are not highly

2019101198 03 Oct 2019 correlated with each single variable. Among service suspending customers, the relationships between age / inner_data, searching ration in weekdays / average searching times per day, number of searching / average searching times per day, star level 7 / type of network 5, net_flux_avg_MinMax / net_inner_dur_avg_MinMax, net_flux_avg_MinMax / s_cnt_MinMax

And the Pearson correlation coefficients are negative and shows a strong relationship between customer type/is_visit_other_url, pay_type/store_balance_MinMax, brand_type 0/brand_type 1, brand_type2/brand_type0, hangy_typeO/hangy_type 1, latn_id6/latn_id 17, wllxO/wllxl, wllx2/wllx0, develop_channel_newO/develop_channel_newl ,3,5,6,8,9,11, kd_mix_typeO/kd_mix_type 1, kd_mix_type0/kd_mix_type2

Service cancelling characteristics analysis

2019101198 03 Oct 2019

In general, the features of service canceling users are similar to that of service suspending users, even if there are slight disparities.

Characteristics of lost users (1) correlation of feature

The correlation of feature of lost users is as same as that of service suspending users and service cancelling users (2) selection of characteristics via using random forest, AdaBoost, ExtraTree, GradientBoosting, DecisionTree, to obtain the best parameters of the characteristics and the precise prediction.

And uniformise the prediction given by each classifier, then take average and figure out the best features: net_inner_dur, account_arpu, cust_arpu, send_amount, inner_date, net_times, net_flux, the amount deposited, age.

Table. 8 Parameters of the selected characteristics

2019101198 03 Oct 2019

feature	RF S MM	ADA S MM	ET S MM	GB S MM	DT S MM	MEAN
n et jn π er d u r avg M i π Max	0.9350	0.8421	1.0000	0.7525	0.7692	0.8598
accau nt arpu avg Min M ax	0.8769	0.0000	0.7779	1.0000	1.0000	0.7310
cu st a rpu avg M i nMax	1.0000	0.0000	0.7601	0.9220	0.8030	0.6970
send amount MinMax	0.8451	0.6842	0.4953	0.5298	0.7194	0.6547
i nner date M ί n Max	0.8980	0.0526	0.8069	0.6829	0.7347	0.6350
n et ti mes avg M inMax	0.8110	0.0526	0.7170	0.7407	0.7392	0.6121
n et flux a vg M i nMax	0.8824	0.0000	0.7444	0.6364	0.5654	0.5657
store balance MinMax	0.3732	1.0000	0.3922	0.3386	0.4099	0.5028
age_MinMax	0.7948	0.0000	0.7197	0.5076	0.4730	0.4990

Data Modeling

Currently, we use GaussianNB, LogisticRegression,

DecisionTreeClassifier, ExtraTreesClassifier, AdaBoostClassifier, MLPClassifier, RandomForestClassifier classifiers. Next step is the determine which classifier is more suitable for this invention. This invention mainly uses Bagging combined algorithm to find the best classification method.

We analyze ROC of each classifier and Bagging combined algorithm State and find that Extra Trees Classifier has the worst performance, while the combined algorithm AUC=0.78, which need to be optimized.

Model Optimizing

According to the model we made previously, the effect of AUC=0.78, it still needs to be optimized, and the main optimization ideas are as following:

2019101198 03 Oct 2019

1) In terms of features, there are many features with non-normal distribution, which we plan to treat logarithmically.

2) In terms of model processing, there is still a lack of fitting at present, and we are hoped to add cross validation to find the best parameters.

3) In the combination mode, Extra Trees Classifier, which has the worst effect, is eliminated.

Optimization of lost users

After taking the logarithm of the original data, we start to train the model. After the first training of the combination model AUC=0.79, we found that naive bayes and logistic classifier were not very effective. In the second training, the above two classifiers were eliminated, and the final combined model AUC=0.823

Optimization for service suspending users

Similar to the previous process, we take the logarithm of the original data, and then we start to train the model.In the first training combination model, AUC=0.832, we found that naive bayes and logistic classifier were not effective. In the second training, the combination of Decision Tree, AdaBoost and Random Forest with the highest scores in the previous training was adopted, and the final AUC=0.863

2019101198 03 Oct 2019

Optimization of service cancelling users

Similar to the previous process, we take the logarithm of the original data, and then we start to train the model. In the first training of the combination model AUC=0.739, we found that naive bayes, logistic and mlp-nn classifier had relatively poor effects. The second training employed DecisionTree and AdaBoost combination, with the combined model AUC=0.783, slightly superior to AdaBoost classifier AUC=0.780 and we finally decided to use AdaBoost classifier

The Future Plan

The results can basically meet the needs. We hope to optimize our model in the future.

For some features, plan to combine with the business, and study whether other feature indicators can be derived, such as binding expiration time, recent revenue trend, etc.

In the combination mode, Extra Trees Classifier, which has the worst effect at present, is eliminated, and in addition, the Stacking mode integration is attempted.

In terms of model processing, it seems that there may be a lack of fitting at present, and we are ready to add cross validation to find the best parameters.

2019101198 03 Oct 2019

Next, we plan to export the model and make it into a small tool, which can be embedded into the existing business App or web page to provide auxiliary reference for business promotion in the territory.

Claims

1. A statistical analysis method of mobile telecom data driven user loss prediction, wherein obtain millions of data from China Telecom of their users’ information and behavior in the last three months to analysis the relationships between each characteristic and the loss of users; and this model can also be used in the future user loss prediction.

2. According to method of claim 1, wherein uses many data collection and processing tools, including missing data complementary and elimination, data normalization, conversion processing, data re-coding, data segmentation, data consolidation and etc.

3. According to method of claim 1, wherein mainly uses bagging combined algorithm to find the best classification method.

4. According to method of claim 1, wherein provide an analysis method and a powerful theoretical and policy basis for China Telecom to update big data in the broadband market competition.