Nothing Special   »   [go: up one dir, main page]

AU2019101198A4 - A statistical analysis method of mobile telecom data driven user loss prediction - Google Patents

A statistical analysis method of mobile telecom data driven user loss prediction Download PDF

Info

Publication number
AU2019101198A4
AU2019101198A4 AU2019101198A AU2019101198A AU2019101198A4 AU 2019101198 A4 AU2019101198 A4 AU 2019101198A4 AU 2019101198 A AU2019101198 A AU 2019101198A AU 2019101198 A AU2019101198 A AU 2019101198A AU 2019101198 A4 AU2019101198 A4 AU 2019101198A4
Authority
AU
Australia
Prior art keywords
data
users
user
shows
telecom
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2019101198A
Inventor
Jingchao Gu
Tiansheng Jin
Yarong Li
Xinyue ZHANG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Li Yarong Miss
Zhang Xinyue Miss
Original Assignee
Li Yarong Miss
Zhang Xinyue Miss
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Li Yarong Miss, Zhang Xinyue Miss filed Critical Li Yarong Miss
Priority to AU2019101198A priority Critical patent/AU2019101198A4/en
Application granted granted Critical
Publication of AU2019101198A4 publication Critical patent/AU2019101198A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Operations Research (AREA)
  • Databases & Information Systems (AREA)
  • Development Economics (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Algebra (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Educational Administration (AREA)
  • Medical Informatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Computing Systems (AREA)
  • Marketing (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This invention is applied to the industry of internet and mobile telecom. This invention extracts data featuring the users' basic information, mobile data service bundle information, broadband network service information, online behaviors in last three months, dpi search keywords and users' suspension and cancellation condition from China Telecom, and studies the relationship between the aforementioned information and broadband customers loss situation by big data analysis method. Meanwhile, we intend to sum up this experience so as to extract feature model for providing strong theoretical and policy basis of operator company's data updating and application to the model use of future user loss prediction. [Data-], Data Grouping GI -G5 Data Data Cleaning suspension and Co clinceltoetr Datasetifo Varianesrine Antalysisno 3773MiB optima Aveageecodsie~nieson 1Conclusioniqu) andsPredictio0 Daase fo Vaiase typesusm~atis Number of varls321 165,Numericlue ToealMssi 33? 1%) 3 ssiBooleanes paye_ !rpehas 52357 1 7% ssg aus ti as381216 125%sissu. aues= Fig. 2

Description

TITLE
A statistical analysis method of mobile telecom data driven user loss prediction
FIELD OF THE INVENTION
This invention is in the field of internet and telecom industry and serves as a method for internet and mobile data user analysis and telecom broadband user loss prediction.
BACKGROUND
The competition for marketing share in the Internet telecom industry is becoming increasingly fierce, globally and especially in China. In China, there are three major mobile operators: China Mobile, China Telecom, and China Unicorn. In this competition, China Telecom focuses on the broadband network share, which is also experienced the internet-boosting, happened in China a few years ago. While recently, the development of the market is trended to be stable, the customer loss is becoming an issue faced by these three operators.
Under such circumstances, there is a pressing requirement for common problems and modeling experience to be summarized to provide the internet telecom operators with ideas for the prediction of high-risk broadband users. This is the Age of Big Data, but the database analyzing i
2019101198 03 Oct 2019 method used by the internet and telecom operators are still relevant backward and cannot matching the rapid demand for a tremendous amount of data processing.
Therefore, this invention extracts the data characteristics of telecom broadband user loss from China Telecom. This invention will provide an analysis method and a powerful theoretical and policy basis for China Telecom to update big data in the broadband market competition.
SUMMARY
This invention uses the user data in the last three months from China Telecom as the object of the study, as shown in Fig. 1. The data has been divided into six groups (the sixth group is the users’ suspension and cancellation information and going to be used for user chum rate analyzing by comparing with the rest five data sets separately) And then, we describe each field for a better understanding about the data’s characters and relationship and results to the user chum rate analysis.
Based on the result brief understanding of all data in the first step, we process the data and complement the outliers to make them easier to be read and input by Python. This invention portrays this procedure as Data Cleaning, including missing data complementary and elimination, data normalization, conversion processing, data re-coding, data segmentation, data consolidation and etc.
2019101198 03 Oct 2019
Then, use Pearson correlation coefficient analysis specialized in the suspension users, cancelled users and user’s loss to confirm optimal influencing characteristics.
At last, this invention use Bagging combinational algorithm to find the best classification method combo among GaussianNB, LogisticRegression, DecisionTreeClassifier, ExtraTreesClassifier, AdaBoostClassifier, MLPClassifier, RandomForestClassifier. And optimize the model we obtained.
DESCRIPTION OF DRAWING
Fig. 1 is the invention flow chart
Fig. 2 shows the overview of users’ basic data set
Fig. 3 shows the description of inner_date
Fig. 4 shows the normalizing of inner_date
Fig. 5 shows the Pearson correlation coefficient analysis of users’ basic data set
Fig. 6 shows the Spearman's rank correlation coefficient analysis of users’ basic data set
Fig. 7 shows the relationship between user chum rate and users age
Fig. 8 shows the description of is_active_user
Fig. 9 shows the description of send_amount
Fig. 10 shows the description of store_balance
Fig. 11 shows the data segmentation
2019101198 03 Oct 2019
Fig. 12 shows the Pearson correlation coefficient analysis between certain selected fields (1) (suspended users)
Fig. 13 shows the Pearson correlation coefficient analysis between certain selected fields (2) (suspended users)
Fig. 14 shows the Pearson correlation coefficient analysis between certain selected fields (3) (suspended users)
Fig. 15 shows the Pearson correlation coefficient between different industries (suspended users)
Fig. 16 shows the Pearson correlation coefficient between different regions (suspended users)
Fig. 17 shows the Pearson correlation coefficient analysis about star level and type of network (suspended users)
Fig. 18 shows the Pearson correlation coefficient between different development channels (suspended users)
Fig. 19 shows the Pearson correlation coefficient between various Web
Service Bundle (suspended users)
Fig. 20 shows the Pearson correlation coefficient analysis between certain selected fields (1) (cancelled users)
Fig. 21 shows the Pearson correlation coefficient analysis between certain selected fields (2) (cancelled users)
Fig. 22 shows the Pearson correlation coefficient analysis between certain selected fields (3)
2019101198 03 Oct 2019
Fig. 23 shows the Pearson correlation coefficient between different industries (cancelled users)
Fig. 24 shows the Pearson correlation coefficient between different regions (cancelled users)
Fig. 25 shows the Pearson correlation coefficient analysis about star level and type of network (cancelled users)
Fig. 26 shows the Pearson correlation coefficient between different development channels (cancelled users)
Fig. 27 shows the Pearson correlation coefficient between various Web
Service Bundle (cancelled users)
DESCRIPTION OF PREFERRED EMBODIMENT
Data Understanding
There are six groups of data set obtained from China Telecom with their user information for last three months:
1) Users’ basic data, abbreviates as user_base_data_a;
2) Users’ DPI data, abbreviates as user_dpi_data_a;
3) Users’ service bundle information, abbreviates as user_mix_data_a;
4) Users’ online behavior in the last three months, abbreviates as user_net_flux_data_a;
5) Users’ Broadband Networks data, abbreviates as user_net_data_a;
2019101198 03 Oct 2019
6) Suspension and cancellation of mobile service user list, abbreviates as user_state_data_a.
For each data set, we firstly elaborate each field for brief comprehension and use Python (using PyCharm to run) as the analysis tool to describe each field in the set. Counting the number of variables and observations, and missing values can indicate the method of data processing. The detailed information for every individual field shows that many missing data need to be filled in. Recodification and normalization are widely used in all of these data sets. Then, we use Pearson correlation coefficient analysis and Spearman's rank correlation coefficient to illustrate the relations between each two fields. And in the last part of this procedure, we combined each data set with the suspension and cancellation of mobile service user list to obtain the user chum rate comparison and to draw the conclusion of user chum prediction.
(1) Users’ Basic Data Set
User’s basic data set concludes the user’s general information and the service bundle’s basic information.
Table. 1 Description of the fields in users’ basic data set
Field Name Field Meaning
2019101198 03 Oct 2019
USER_ID User ID
LATNJD City
AGE User’s age
SEX User’s gender
STARLEVEL User’s star-rating level
CUSTTYPE Client type
CORP_NAME City-level branch
SUBST_NAME County-level branch
BRANCH_NAME Service fulfillment business hall
DEVELOP_CHANNEL_NEW Agency (Channel Development)
WLLX Housing type
HANGYTYPE User’s career industry type
BASEOFFERRH Service bundle type
PAYTYPE Payment type
INNER_DATE User’s account time
ACCESSTYPE Network line access mode
Table. 1 elaborates all the fields name in this set by practical significance.
And Fig. 2 shows the overview result of the program running.
2019101198 03 Oct 2019
Taking the field inner_date as an example, this diagram in Fig. 3 shows description of the time duration for each existing user being in company’s service that we collected from last three month.
The data of this field is concentrated in certain categories, so it is necessary to pay attention to the fitting problem. This field can be converted into integer to improve the program processing speed. The maximum value of this field is 1406, that is 117 years if calculated by month, which can be considered as an outlier.
Fig. 4 is the normalizing of inner_date field. There is one outlier 1406 should be discarded.
Fig. 5 and Fig. 6 shows the Pearson correlation coefficient analysis and Spearman's rank correlation coefficient analysis among the fields in the first data set. According to the Pearson correlation coefficient and Spearman's rank correlation coefficient, we can have the following conclusions:
1) CORPID and LATN_ID have strong correlation, which means that “City-level branch” and “City” have strong correlation. Considering about more missing data in “City-level branch”, we eliminate this field;
2) AGE and INNER_DATE have correlation. The elder the user, the
2019101198 03 Oct 2019 longer the service time;
3) DEVELOP_CHANNEL_NEW and AGE have negative correlation;
4) DEVELOP_CHANNEL_NEW and INNER_DATE have strong negative correlation.
Fig. 7 shows the relationship between user chum rate and users age. We can conclude that the younger the age, the higher the rate of terminating the service. The users who are 40-60 years old are more stable than customers under 40 years old, which makes them the main targets of the company to reach out.
The following five data set are using the same method to describe.
Table. 2 to Table. 6 shows the elaboration for every field of rest five data sets:
(2) Users’ Service Bundle Information Set
Table. 2 Description of the fields in users’ service bundle information set
Field Name Field Meaning
USERJD User ID
KD_MIX_TYPE Broadband service bundle type
OFFERNAME Service bundle name
XYQLIMITDATE Expiration month of yearly package
2019101198 03 Oct 2019
YCOFFERNAME Deposit package name
YCEIMITDATE Expiration date of the deposit package
FIMITDATENEW Discount expiration date
YW_SPEED_VAFUE Package network speed
TY_SPEED_VAEUE Experienced network speed
ZD_SPEED_VAFUE Maximum accessible broadband network speed
UPSPEEDMONTH Month for last internet speeding up
ISTZG Whether the user is eligible for replacing copper wire with optical fiber
ISMOREKD Whether the user is eligible for internet speeding up
ISATTENSPEED Whether the user’s main concern is network speed
UPSPEEDHANGY Whether the user is in the industry field with the potential of internet speeding up requirement
ISVISTOTHERURL Whether the user ever visited a competitor's web page
2019101198 03 Oct 2019
ISPHSTK Whether the user opened this account with deposit
IS_ACTIVE_USER Active user or not
STOREBALANCE Deposit principal amount
SEND_AMOUNT Upload data consumption
(3) Users’ Broadband Networks Data Set
Table. 3 Description of the fields in users’ broadband networks data set
Field Name Field Meaning
USER_ID User ID
NEW_MIX_TYPE Service bundle type
CUST_MOBILE_NUMS number of mobile phones owned by the user
CUST_GH_NUMS number of fixed phones owned by the user
CUST_ITV_NUMS number of iTV owned by the user
ZK-ARPU mobile revenue
BRANDTYPE Mobile telecom provider
BRAND Mobile phone brand name
ZKOFFERNAME Mobile data service bundle type
AMOUNT_09 Mobile phone data
2019101198 03 Oct 2019 (4) Users’ Online Behavior in the Last Three Months
Table. 4 Description of the fields in the set of users’ online behavior in the last three months
Field Name Field Meaning
USERJD User ID
CUSTARPU General income in last month
ACCOUNTARPU Broadband revenues in last month
NET_INNER_DUR Total online time in last month
NET_TIMES Gsesp in last month
NET_FLUX Online data consumption in last month
RECV-FLUX Downstream in last month
SEND_FLUX Uploading in last month
CUST_ARPU_PRE2 General income in the month before last
ACCOUNTARPUPRE2 Broadband revenues in the month before last
NET_INNER_DUR_PRE2 Total online time in the month before last
NET_TIMES_PRE2 Gsesp in the month before last
2019101198 03 Oct 2019
NET_FLUX_PRE2 Online data consumption in the month before last
RECV_FLUX_PRE2 Downstream in the month before last
SEND_FLUX_PRE2 Uploading in the month before last
CUST_ARPU_PRE3 General income in the month of two month ago
ACCOUNTARPUPRE3 Broadband revenues in the month of two month ago
NET_INNER_DUR_PRE3 Total online time in the month of two month ago
NET_TIMES_PRE3 Gsesp in the month of two month ago
NET_FLUX_PRE3 Online data consumption in the month of two month ago
RECV_FLUX_PRE3 Downstream in the month of two month ago
SEND_FLUX_PRE3 Uploading in the month of two month ago
(5) Users’ DPI Data Set
2019101198 03 Oct 2019
Table. 5 Description of the fields in users’ DPI data set
Field Name Field Meaning
user_id User ID
keywords Online searching keyword
keywords_search_cnt_work Number of searching times on weekdays
keywords_visit_date_work Number of online days on weekdays
keywords_search_cnt_holid Number of searching times on holidays
keywords_visit_date_holid Number of online days on holidays
is_holiday Whether the keyword is the holiday
(6) Suspension and Cancellation of Mobile Service User List
Table. 6 Description of the fields in the set of suspension and cancellation of mobile service user list
Field Name Field Meaning
user_id User ID
State Suspension and cancellation status
Data Cleaning
2019101198 03 Oct 2019
After all required data being collected up and described, it conies to the next step of analyzing and selecting. It plays a crucial role in the whole project since it determines the accuracy of the final results.
The collected data have a series of disturbing factors, it can only be analyzed after cleaning, by the way of eliminating the outlier and any unreasonable values.
There is the exact processing procedure and methods:
(1) Missing value:
Solution:
1) Missing value treatment: If the missing values are small, the stochastic forest model is used to fill in the missing values.
2) Elimination treatment: If the missing value exceeds 50%, discard it.
#note: Missing values, outliers, unknown data all should be classified as missing values (2) Processing of continuous value:
From the point of view of readability, the normalized Min-Max
Normalization method is adopted in this paper.
2019101198 03 Oct 2019
Transform. Map the result value to [0 - 1], The conversion function is as follows:
x — min x’ -----------max — mm
Max is the maximum value of sample data and min is the minimum value of sample data. The advantage of this method is that it is more tolerant.
Convertible, the drawback is that when new data is added, it may lead to changes in Max and min, which need to be redefined.
(3) Conversion processing:
For example, Fig. 8 is demonstrating the conversion of is_active_user into integer or Boolean type to improve conversion speed.
# Boolean type or integer:
(4) Processing of normalization and discretization:
Fig. 9 shows the maximum value of this field, send_amount, is 113590000. We should normalize the data of this field.
Fig. 10 shows the maximum value of store_balance is 44974, we need to do outlier processing, try normalizing processing.
2019101198 03 Oct 2019 (5) Extra examples of data processing:
Table. 7 List of data processing
Processing content English field name Loss rate Processing content
Type conversion and missing value user_id 0.0% USER_BASE_DATA_A
Offer_id 1.7% USER_NET_DATA_A
Y w_speed_value 1.7% USER_NET_DATA_A
Take the three-month mean and normalize it Cust_arpu USER_NET_FLUX_DATA- A
Account_arup USER_NET_FLUX_DATA- A
Net_inner_dur USER_NET_FLUX_DATA- A
delete New_imx_type 49.4% USER_MIX_DATA-A
Send_flux_Pre2 USER_NET_FLUX_DATA- A
Cust_arpu_pre3 USER_NET_FLUX_DATA- A
eliminate Is_phs_tk 0.0% USER_NET_DATA_A
Is_h°liday USERDPIDATAA
2019101198 03 Oct 2019
Corpid 12.5% USER_BASE_DATA_A
normalization Store_balance 0.0% USER_NET_DATA_A
Send_amount 1.7% USER_NET_DATA_A
code Access type 0.0% USER_BASE_DATA_A
wllx 0.0% USER_BASE_DATA_A
data coding:
Treat discrete data respectively with thermal coding, by the use of specific control table data segmentation
Because of the large amount of data, slicing by variables is used in practice, and each single variable generates a band.
Indexed files, processed separately and then merged, are somewhat similar to MapReduce, shown as Fig. 11.
Data consolidation
For the cleaned data, three files are formed: train_l.txt-stop list, train_2.txt-disassemble and clear.
Single, train_ls.txt - Loss list (i.e., shutdown and disassembly meet one)
Feature engineering
2019101198 03 Oct 2019
Analysis of service suspending user characteristic
From this graph (Fig. 12), it is obvious that, the correlation between these variables is quite weak. Compared with rest of data, the correlation between inner_date and age_MaxMin is relatively strong.
Similar with previous chart, the correlation shown by this graph (Fig. 13) is not strong, as well.
Nevertheless, the relationship between income and the quantity of customers’ mobile phone is extremely strong.
In this table (Fig. 14), the correlation between ‘searching ration in weekdays’ and ‘average number of searching per day’ is strong.
Besides, ‘total number of searching’ and ‘average quantity of searching per day’ shows a Moderate correlation.
This chart (Fig. 15) demonstrates the relationships between different industries.
Among these industries, ‘industry 1’ and ‘industry 0’ show a Strong negative correlation. The correlation between rest of industries is so weak that can be ignored.
2019101198 03 Oct 2019
This chart (Fig. 16) shows the Pearson correlation between different regions.
The correlation between each region are all negative which numerical value is so small that can be ignored. But between ‘region 17’ and ‘region 6’, there is a weak correlation.
This is a chart (Fig. 17) about star level and type of network, ‘type 5’ and ‘star_level 7’ show a weak correlation, ‘type 0’ and ‘type 1’ shows a negative correlation
The graph (Fig. 18) shows the relationship between different development channels.
The correlation between most of develop channels can be ignored, however, between develop channel o and develop channel 11/9/8/6/5/3/1, the correlation is relative strong.
There are the correlations between various Web Service Bundle (Fig. 19).
There is weak correlation between ‘type 2’ and ‘type 0’ and a strong negative correlation between ‘type 1 ’ and ‘type 0’
Overall, the features of service suspending customers have some inner relationship, even if service suspending customers are not highly
2019101198 03 Oct 2019 correlated with each single variable. Among service suspending customers, the relationships between age / inner_data, searching ration in weekdays / average searching times per day, number of searching / average searching times per day, star level 7 / type of network 5, net_flux_avg_MinMax / net_inner_dur_avg_MinMax, net_flux_avg_MinMax / s_cnt_MinMax
And the Pearson correlation coefficients are negative and shows a strong relationship between customer type/is_visit_other_url, pay_type/store_balance_MinMax, brand_type 0/brand_type 1, brand_type2/brand_type0, hangy_typeO/hangy_type 1, latn_id6/latn_id 17, wllxO/wllxl, wllx2/wllx0, develop_channel_newO/develop_channel_newl ,3,5,6,8,9,11, kd_mix_typeO/kd_mix_type 1, kd_mix_type0/kd_mix_type2
Service cancelling characteristics analysis
2019101198 03 Oct 2019
In general, the features of service canceling users are similar to that of service suspending users, even if there are slight disparities.
Characteristics of lost users (1) correlation of feature
The correlation of feature of lost users is as same as that of service suspending users and service cancelling users (2) selection of characteristics via using random forest, AdaBoost, ExtraTree, GradientBoosting, DecisionTree, to obtain the best parameters of the characteristics and the precise prediction.
And uniformise the prediction given by each classifier, then take average and figure out the best features: net_inner_dur, account_arpu, cust_arpu, send_amount, inner_date, net_times, net_flux, the amount deposited, age.
Table. 8 Parameters of the selected characteristics
2019101198 03 Oct 2019
feature RF S MM ADA S MM ET S MM GB S MM DT S MM MEAN
n et jn π er d u r avg M i π Max 0.9350 0.8421 1.0000 0.7525 0.7692 0.8598
accau nt arpu avg Min M ax 0.8769 0.0000 0.7779 1.0000 1.0000 0.7310
cu st a rpu avg M i nMax 1.0000 0.0000 0.7601 0.9220 0.8030 0.6970
send amount MinMax 0.8451 0.6842 0.4953 0.5298 0.7194 0.6547
i nner date M ί n Max 0.8980 0.0526 0.8069 0.6829 0.7347 0.6350
n et ti mes avg M inMax 0.8110 0.0526 0.7170 0.7407 0.7392 0.6121
n et flux a vg M i nMax 0.8824 0.0000 0.7444 0.6364 0.5654 0.5657
store balance MinMax 0.3732 1.0000 0.3922 0.3386 0.4099 0.5028
age_MinMax 0.7948 0.0000 0.7197 0.5076 0.4730 0.4990
Data Modeling
Currently, we use GaussianNB, LogisticRegression,
DecisionTreeClassifier, ExtraTreesClassifier, AdaBoostClassifier, MLPClassifier, RandomForestClassifier classifiers. Next step is the determine which classifier is more suitable for this invention. This invention mainly uses Bagging combined algorithm to find the best classification method.
We analyze ROC of each classifier and Bagging combined algorithm State and find that Extra Trees Classifier has the worst performance, while the combined algorithm AUC=0.78, which need to be optimized.
Model Optimizing
According to the model we made previously, the effect of AUC=0.78, it still needs to be optimized, and the main optimization ideas are as following:
2019101198 03 Oct 2019
1) In terms of features, there are many features with non-normal distribution, which we plan to treat logarithmically.
2) In terms of model processing, there is still a lack of fitting at present, and we are hoped to add cross validation to find the best parameters.
3) In the combination mode, Extra Trees Classifier, which has the worst effect, is eliminated.
Optimization of lost users
After taking the logarithm of the original data, we start to train the model. After the first training of the combination model AUC=0.79, we found that naive bayes and logistic classifier were not very effective. In the second training, the above two classifiers were eliminated, and the final combined model AUC=0.823
Optimization for service suspending users
Similar to the previous process, we take the logarithm of the original data, and then we start to train the model.In the first training combination model, AUC=0.832, we found that naive bayes and logistic classifier were not effective. In the second training, the combination of Decision Tree, AdaBoost and Random Forest with the highest scores in the previous training was adopted, and the final AUC=0.863
2019101198 03 Oct 2019
Optimization of service cancelling users
Similar to the previous process, we take the logarithm of the original data, and then we start to train the model. In the first training of the combination model AUC=0.739, we found that naive bayes, logistic and mlp-nn classifier had relatively poor effects. The second training employed DecisionTree and AdaBoost combination, with the combined model AUC=0.783, slightly superior to AdaBoost classifier AUC=0.780 and we finally decided to use AdaBoost classifier
The Future Plan
The results can basically meet the needs. We hope to optimize our model in the future.
For some features, plan to combine with the business, and study whether other feature indicators can be derived, such as binding expiration time, recent revenue trend, etc.
In the combination mode, Extra Trees Classifier, which has the worst effect at present, is eliminated, and in addition, the Stacking mode integration is attempted.
In terms of model processing, it seems that there may be a lack of fitting at present, and we are ready to add cross validation to find the best parameters.
2019101198 03 Oct 2019
Next, we plan to export the model and make it into a small tool, which can be embedded into the existing business App or web page to provide auxiliary reference for business promotion in the territory.

Claims (4)

1. A statistical analysis method of mobile telecom data driven user loss prediction, wherein obtain millions of data from China Telecom of their users’ information and behavior in the last three months to analysis the relationships between each characteristic and the loss of users; and this model can also be used in the future user loss prediction.
2. According to method of claim 1, wherein uses many data collection and processing tools, including missing data complementary and elimination, data normalization, conversion processing, data re-coding, data segmentation, data consolidation and etc.
3. According to method of claim 1, wherein mainly uses bagging combined algorithm to find the best classification method.
4. According to method of claim 1, wherein provide an analysis method and a powerful theoretical and policy basis for China Telecom to update big data in the broadband market competition.
AU2019101198A 2019-10-03 2019-10-03 A statistical analysis method of mobile telecom data driven user loss prediction Ceased AU2019101198A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2019101198A AU2019101198A4 (en) 2019-10-03 2019-10-03 A statistical analysis method of mobile telecom data driven user loss prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2019101198A AU2019101198A4 (en) 2019-10-03 2019-10-03 A statistical analysis method of mobile telecom data driven user loss prediction

Publications (1)

Publication Number Publication Date
AU2019101198A4 true AU2019101198A4 (en) 2020-01-16

Family

ID=69146757

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2019101198A Ceased AU2019101198A4 (en) 2019-10-03 2019-10-03 A statistical analysis method of mobile telecom data driven user loss prediction

Country Status (1)

Country Link
AU (1) AU2019101198A4 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309718A (en) * 2020-02-19 2020-06-19 南方电网科学研究院有限责任公司 Distribution network voltage data missing filling method and device
CN112686718A (en) * 2021-03-19 2021-04-20 深圳索信达数据技术有限公司 Method and device for acquiring user loss reason, computer equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309718A (en) * 2020-02-19 2020-06-19 南方电网科学研究院有限责任公司 Distribution network voltage data missing filling method and device
CN111309718B (en) * 2020-02-19 2023-05-23 南方电网科学研究院有限责任公司 Distribution network voltage data missing filling method and device
CN112686718A (en) * 2021-03-19 2021-04-20 深圳索信达数据技术有限公司 Method and device for acquiring user loss reason, computer equipment and storage medium
CN112686718B (en) * 2021-03-19 2021-06-29 深圳索信达数据技术有限公司 Method and device for acquiring user loss reason, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107562818B (en) Information recommendation system and method
US20090028183A1 (en) Platform for communicating across multiple communication channels
CN107688967A (en) The Forecasting Methodology and terminal device of client's purchase intention
AU2019101198A4 (en) A statistical analysis method of mobile telecom data driven user loss prediction
CN113051291A (en) Work order information processing method, device, equipment and storage medium
CN114547475B (en) Resource recommendation method, device and system
CN108564255A (en) Matching Model construction method, orphan's list distribution method, device, medium and terminal
Napitu et al. Twitter opinion mining predicts broadband internet's customer churn rate
CN113111250A (en) Service recommendation method and device, related equipment and storage medium
CN107977855B (en) Method and device for managing user information
CN112036631B (en) Purchasing quantity determining method, purchasing quantity determining device, purchasing quantity determining equipment and storage medium
CN113989020A (en) Loan overdue information processing method and device, computer equipment and storage medium
CN111104603A (en) Real-time hybrid recommendation method and system based on Lambda architecture
JP2002297875A (en) Customer relation management method, system and program
US20210357953A1 (en) Availability ranking system and method
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
AU2014204120A1 (en) Priority-weighted quota cell selection to match a panelist to a market research project
Martín et al. A numerical analysis of allocation strategies for the multi-armed bandit problem under delayed rewards conditions in digital campaign management
CN112288402A (en) Data processing method, device, equipment and storage medium
CN115880077A (en) Recommendation method and device based on client label, electronic device and storage medium
CN113641654B (en) Marketing treatment rule engine method based on real-time event
CN117725313B (en) Intelligent identification and recommendation system
JP7368897B1 (en) information processing equipment
Yang et al. Analysis on marketing ability and financial performance in internet company
Petrovic Adopting Data Mining Techniques in Telecommunications Industry: Call Center Case Study

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry