AU2019101198A4 - A statistical analysis method of mobile telecom data driven user loss prediction - Google Patents
A statistical analysis method of mobile telecom data driven user loss prediction Download PDFInfo
- Publication number
- AU2019101198A4 AU2019101198A4 AU2019101198A AU2019101198A AU2019101198A4 AU 2019101198 A4 AU2019101198 A4 AU 2019101198A4 AU 2019101198 A AU2019101198 A AU 2019101198A AU 2019101198 A AU2019101198 A AU 2019101198A AU 2019101198 A4 AU2019101198 A4 AU 2019101198A4
- Authority
- AU
- Australia
- Prior art keywords
- data
- users
- user
- shows
- telecom
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
- G06Q30/0202—Market predictions or forecasting for commercial activities
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- General Engineering & Computer Science (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Operations Research (AREA)
- Databases & Information Systems (AREA)
- Development Economics (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Computational Biology (AREA)
- Algebra (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Educational Administration (AREA)
- Medical Informatics (AREA)
- Game Theory and Decision Science (AREA)
- Computing Systems (AREA)
- Marketing (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This invention is applied to the industry of internet and mobile telecom. This invention extracts data featuring the users' basic information, mobile data service bundle information, broadband network service information, online behaviors in last three months, dpi search keywords and users' suspension and cancellation condition from China Telecom, and studies the relationship between the aforementioned information and broadband customers loss situation by big data analysis method. Meanwhile, we intend to sum up this experience so as to extract feature model for providing strong theoretical and policy basis of operator company's data updating and application to the model use of future user loss prediction. [Data-], Data Grouping GI -G5 Data Data Cleaning suspension and Co clinceltoetr Datasetifo Varianesrine Antalysisno 3773MiB optima Aveageecodsie~nieson 1Conclusioniqu) andsPredictio0 Daase fo Vaiase typesusm~atis Number of varls321 165,Numericlue ToealMssi 33? 1%) 3 ssiBooleanes paye_ !rpehas 52357 1 7% ssg aus ti as381216 125%sissu. aues= Fig. 2
Description
TITLE
A statistical analysis method of mobile telecom data driven user loss prediction
FIELD OF THE INVENTION
This invention is in the field of internet and telecom industry and serves as a method for internet and mobile data user analysis and telecom broadband user loss prediction.
BACKGROUND
The competition for marketing share in the Internet telecom industry is becoming increasingly fierce, globally and especially in China. In China, there are three major mobile operators: China Mobile, China Telecom, and China Unicorn. In this competition, China Telecom focuses on the broadband network share, which is also experienced the internet-boosting, happened in China a few years ago. While recently, the development of the market is trended to be stable, the customer loss is becoming an issue faced by these three operators.
Under such circumstances, there is a pressing requirement for common problems and modeling experience to be summarized to provide the internet telecom operators with ideas for the prediction of high-risk broadband users. This is the Age of Big Data, but the database analyzing i
2019101198 03 Oct 2019 method used by the internet and telecom operators are still relevant backward and cannot matching the rapid demand for a tremendous amount of data processing.
Therefore, this invention extracts the data characteristics of telecom broadband user loss from China Telecom. This invention will provide an analysis method and a powerful theoretical and policy basis for China Telecom to update big data in the broadband market competition.
SUMMARY
This invention uses the user data in the last three months from China Telecom as the object of the study, as shown in Fig. 1. The data has been divided into six groups (the sixth group is the users’ suspension and cancellation information and going to be used for user chum rate analyzing by comparing with the rest five data sets separately) And then, we describe each field for a better understanding about the data’s characters and relationship and results to the user chum rate analysis.
Based on the result brief understanding of all data in the first step, we process the data and complement the outliers to make them easier to be read and input by Python. This invention portrays this procedure as Data Cleaning, including missing data complementary and elimination, data normalization, conversion processing, data re-coding, data segmentation, data consolidation and etc.
2019101198 03 Oct 2019
Then, use Pearson correlation coefficient analysis specialized in the suspension users, cancelled users and user’s loss to confirm optimal influencing characteristics.
At last, this invention use Bagging combinational algorithm to find the best classification method combo among GaussianNB, LogisticRegression, DecisionTreeClassifier, ExtraTreesClassifier, AdaBoostClassifier, MLPClassifier, RandomForestClassifier. And optimize the model we obtained.
DESCRIPTION OF DRAWING
Fig. 1 is the invention flow chart
Fig. 2 shows the overview of users’ basic data set
Fig. 3 shows the description of inner_date
Fig. 4 shows the normalizing of inner_date
Fig. 5 shows the Pearson correlation coefficient analysis of users’ basic data set
Fig. 6 shows the Spearman's rank correlation coefficient analysis of users’ basic data set
Fig. 7 shows the relationship between user chum rate and users age
Fig. 8 shows the description of is_active_user
Fig. 9 shows the description of send_amount
Fig. 10 shows the description of store_balance
Fig. 11 shows the data segmentation
2019101198 03 Oct 2019
Fig. 12 shows the Pearson correlation coefficient analysis between certain selected fields (1) (suspended users)
Fig. 13 shows the Pearson correlation coefficient analysis between certain selected fields (2) (suspended users)
Fig. 14 shows the Pearson correlation coefficient analysis between certain selected fields (3) (suspended users)
Fig. 15 shows the Pearson correlation coefficient between different industries (suspended users)
Fig. 16 shows the Pearson correlation coefficient between different regions (suspended users)
Fig. 17 shows the Pearson correlation coefficient analysis about star level and type of network (suspended users)
Fig. 18 shows the Pearson correlation coefficient between different development channels (suspended users)
Fig. 19 shows the Pearson correlation coefficient between various Web
Service Bundle (suspended users)
Fig. 20 shows the Pearson correlation coefficient analysis between certain selected fields (1) (cancelled users)
Fig. 21 shows the Pearson correlation coefficient analysis between certain selected fields (2) (cancelled users)
Fig. 22 shows the Pearson correlation coefficient analysis between certain selected fields (3)
2019101198 03 Oct 2019
Fig. 23 shows the Pearson correlation coefficient between different industries (cancelled users)
Fig. 24 shows the Pearson correlation coefficient between different regions (cancelled users)
Fig. 25 shows the Pearson correlation coefficient analysis about star level and type of network (cancelled users)
Fig. 26 shows the Pearson correlation coefficient between different development channels (cancelled users)
Fig. 27 shows the Pearson correlation coefficient between various Web
Service Bundle (cancelled users)
DESCRIPTION OF PREFERRED EMBODIMENT
Data Understanding
There are six groups of data set obtained from China Telecom with their user information for last three months:
1) Users’ basic data, abbreviates as user_base_data_a;
2) Users’ DPI data, abbreviates as user_dpi_data_a;
3) Users’ service bundle information, abbreviates as user_mix_data_a;
4) Users’ online behavior in the last three months, abbreviates as user_net_flux_data_a;
5) Users’ Broadband Networks data, abbreviates as user_net_data_a;
2019101198 03 Oct 2019
6) Suspension and cancellation of mobile service user list, abbreviates as user_state_data_a.
For each data set, we firstly elaborate each field for brief comprehension and use Python (using PyCharm to run) as the analysis tool to describe each field in the set. Counting the number of variables and observations, and missing values can indicate the method of data processing. The detailed information for every individual field shows that many missing data need to be filled in. Recodification and normalization are widely used in all of these data sets. Then, we use Pearson correlation coefficient analysis and Spearman's rank correlation coefficient to illustrate the relations between each two fields. And in the last part of this procedure, we combined each data set with the suspension and cancellation of mobile service user list to obtain the user chum rate comparison and to draw the conclusion of user chum prediction.
(1) Users’ Basic Data Set
User’s basic data set concludes the user’s general information and the service bundle’s basic information.
Table. 1 Description of the fields in users’ basic data set
Field Name | Field Meaning |
2019101198 03 Oct 2019
USER_ID | User ID |
LATNJD | City |
AGE | User’s age |
SEX | User’s gender |
STARLEVEL | User’s star-rating level |
CUSTTYPE | Client type |
CORP_NAME | City-level branch |
SUBST_NAME | County-level branch |
BRANCH_NAME | Service fulfillment business hall |
DEVELOP_CHANNEL_NEW | Agency (Channel Development) |
WLLX | Housing type |
HANGYTYPE | User’s career industry type |
BASEOFFERRH | Service bundle type |
PAYTYPE | Payment type |
INNER_DATE | User’s account time |
ACCESSTYPE | Network line access mode |
Table. 1 elaborates all the fields name in this set by practical significance.
And Fig. 2 shows the overview result of the program running.
2019101198 03 Oct 2019
Taking the field inner_date as an example, this diagram in Fig. 3 shows description of the time duration for each existing user being in company’s service that we collected from last three month.
The data of this field is concentrated in certain categories, so it is necessary to pay attention to the fitting problem. This field can be converted into integer to improve the program processing speed. The maximum value of this field is 1406, that is 117 years if calculated by month, which can be considered as an outlier.
Fig. 4 is the normalizing of inner_date field. There is one outlier 1406 should be discarded.
Fig. 5 and Fig. 6 shows the Pearson correlation coefficient analysis and Spearman's rank correlation coefficient analysis among the fields in the first data set. According to the Pearson correlation coefficient and Spearman's rank correlation coefficient, we can have the following conclusions:
1) CORPID and LATN_ID have strong correlation, which means that “City-level branch” and “City” have strong correlation. Considering about more missing data in “City-level branch”, we eliminate this field;
2) AGE and INNER_DATE have correlation. The elder the user, the
2019101198 03 Oct 2019 longer the service time;
3) DEVELOP_CHANNEL_NEW and AGE have negative correlation;
4) DEVELOP_CHANNEL_NEW and INNER_DATE have strong negative correlation.
Fig. 7 shows the relationship between user chum rate and users age. We can conclude that the younger the age, the higher the rate of terminating the service. The users who are 40-60 years old are more stable than customers under 40 years old, which makes them the main targets of the company to reach out.
The following five data set are using the same method to describe.
Table. 2 to Table. 6 shows the elaboration for every field of rest five data sets:
(2) Users’ Service Bundle Information Set
Table. 2 Description of the fields in users’ service bundle information set
Field Name | Field Meaning |
USERJD | User ID |
KD_MIX_TYPE | Broadband service bundle type |
OFFERNAME | Service bundle name |
XYQLIMITDATE | Expiration month of yearly package |
2019101198 03 Oct 2019
YCOFFERNAME | Deposit package name |
YCEIMITDATE | Expiration date of the deposit package |
FIMITDATENEW | Discount expiration date |
YW_SPEED_VAFUE | Package network speed |
TY_SPEED_VAEUE | Experienced network speed |
ZD_SPEED_VAFUE | Maximum accessible broadband network speed |
UPSPEEDMONTH | Month for last internet speeding up |
ISTZG | Whether the user is eligible for replacing copper wire with optical fiber |
ISMOREKD | Whether the user is eligible for internet speeding up |
ISATTENSPEED | Whether the user’s main concern is network speed |
UPSPEEDHANGY | Whether the user is in the industry field with the potential of internet speeding up requirement |
ISVISTOTHERURL | Whether the user ever visited a competitor's web page |
2019101198 03 Oct 2019
ISPHSTK | Whether the user opened this account with deposit |
IS_ACTIVE_USER | Active user or not |
STOREBALANCE | Deposit principal amount |
SEND_AMOUNT | Upload data consumption |
(3) Users’ Broadband Networks Data Set
Table. 3 Description of the fields in users’ broadband networks data set
Field Name | Field Meaning |
USER_ID | User ID |
NEW_MIX_TYPE | Service bundle type |
CUST_MOBILE_NUMS | number of mobile phones owned by the user |
CUST_GH_NUMS | number of fixed phones owned by the user |
CUST_ITV_NUMS | number of iTV owned by the user |
ZK-ARPU | mobile revenue |
BRANDTYPE | Mobile telecom provider |
BRAND | Mobile phone brand name |
ZKOFFERNAME | Mobile data service bundle type |
AMOUNT_09 | Mobile phone data |
2019101198 03 Oct 2019 (4) Users’ Online Behavior in the Last Three Months
Table. 4 Description of the fields in the set of users’ online behavior in the last three months
Field Name | Field Meaning |
USERJD | User ID |
CUSTARPU | General income in last month |
ACCOUNTARPU | Broadband revenues in last month |
NET_INNER_DUR | Total online time in last month |
NET_TIMES | Gsesp in last month |
NET_FLUX | Online data consumption in last month |
RECV-FLUX | Downstream in last month |
SEND_FLUX | Uploading in last month |
CUST_ARPU_PRE2 | General income in the month before last |
ACCOUNTARPUPRE2 | Broadband revenues in the month before last |
NET_INNER_DUR_PRE2 | Total online time in the month before last |
NET_TIMES_PRE2 | Gsesp in the month before last |
2019101198 03 Oct 2019
NET_FLUX_PRE2 | Online data consumption in the month before last |
RECV_FLUX_PRE2 | Downstream in the month before last |
SEND_FLUX_PRE2 | Uploading in the month before last |
CUST_ARPU_PRE3 | General income in the month of two month ago |
ACCOUNTARPUPRE3 | Broadband revenues in the month of two month ago |
NET_INNER_DUR_PRE3 | Total online time in the month of two month ago |
NET_TIMES_PRE3 | Gsesp in the month of two month ago |
NET_FLUX_PRE3 | Online data consumption in the month of two month ago |
RECV_FLUX_PRE3 | Downstream in the month of two month ago |
SEND_FLUX_PRE3 | Uploading in the month of two month ago |
(5) Users’ DPI Data Set
2019101198 03 Oct 2019
Table. 5 Description of the fields in users’ DPI data set
Field Name | Field Meaning |
user_id | User ID |
keywords | Online searching keyword |
keywords_search_cnt_work | Number of searching times on weekdays |
keywords_visit_date_work | Number of online days on weekdays |
keywords_search_cnt_holid | Number of searching times on holidays |
keywords_visit_date_holid | Number of online days on holidays |
is_holiday | Whether the keyword is the holiday |
(6) Suspension and Cancellation of Mobile Service User List
Table. 6 Description of the fields in the set of suspension and cancellation of mobile service user list
Field Name | Field Meaning |
user_id | User ID |
State | Suspension and cancellation status |
Data Cleaning
2019101198 03 Oct 2019
After all required data being collected up and described, it conies to the next step of analyzing and selecting. It plays a crucial role in the whole project since it determines the accuracy of the final results.
The collected data have a series of disturbing factors, it can only be analyzed after cleaning, by the way of eliminating the outlier and any unreasonable values.
There is the exact processing procedure and methods:
(1) Missing value:
Solution:
1) Missing value treatment: If the missing values are small, the stochastic forest model is used to fill in the missing values.
2) Elimination treatment: If the missing value exceeds 50%, discard it.
#note: Missing values, outliers, unknown data all should be classified as missing values (2) Processing of continuous value:
From the point of view of readability, the normalized Min-Max
Normalization method is adopted in this paper.
2019101198 03 Oct 2019
Transform. Map the result value to [0 - 1], The conversion function is as follows:
x — min x’ -----------max — mm
Max is the maximum value of sample data and min is the minimum value of sample data. The advantage of this method is that it is more tolerant.
Convertible, the drawback is that when new data is added, it may lead to changes in Max and min, which need to be redefined.
(3) Conversion processing:
For example, Fig. 8 is demonstrating the conversion of is_active_user into integer or Boolean type to improve conversion speed.
# Boolean type or integer:
(4) Processing of normalization and discretization:
Fig. 9 shows the maximum value of this field, send_amount, is 113590000. We should normalize the data of this field.
Fig. 10 shows the maximum value of store_balance is 44974, we need to do outlier processing, try normalizing processing.
2019101198 03 Oct 2019 (5) Extra examples of data processing:
Table. 7 List of data processing
Processing content | English field name | Loss rate | Processing content |
Type conversion and missing value | user_id | 0.0% | USER_BASE_DATA_A |
Offer_id | 1.7% | USER_NET_DATA_A | |
Y w_speed_value | 1.7% | USER_NET_DATA_A | |
Take the three-month mean and normalize it | Cust_arpu | USER_NET_FLUX_DATA- A | |
Account_arup | USER_NET_FLUX_DATA- A | ||
Net_inner_dur | USER_NET_FLUX_DATA- A | ||
delete | New_imx_type | 49.4% | USER_MIX_DATA-A |
Send_flux_Pre2 | USER_NET_FLUX_DATA- A | ||
Cust_arpu_pre3 | USER_NET_FLUX_DATA- A | ||
eliminate | Is_phs_tk | 0.0% | USER_NET_DATA_A |
Is_h°liday | USERDPIDATAA |
2019101198 03 Oct 2019
Corpid | 12.5% | USER_BASE_DATA_A | |
normalization | Store_balance | 0.0% | USER_NET_DATA_A |
Send_amount | 1.7% | USER_NET_DATA_A | |
code | Access type | 0.0% | USER_BASE_DATA_A |
wllx | 0.0% | USER_BASE_DATA_A |
data coding:
Treat discrete data respectively with thermal coding, by the use of specific control table data segmentation
Because of the large amount of data, slicing by variables is used in practice, and each single variable generates a band.
Indexed files, processed separately and then merged, are somewhat similar to MapReduce, shown as Fig. 11.
Data consolidation
For the cleaned data, three files are formed: train_l.txt-stop list, train_2.txt-disassemble and clear.
Single, train_ls.txt - Loss list (i.e., shutdown and disassembly meet one)
Feature engineering
2019101198 03 Oct 2019
Analysis of service suspending user characteristic
From this graph (Fig. 12), it is obvious that, the correlation between these variables is quite weak. Compared with rest of data, the correlation between inner_date and age_MaxMin is relatively strong.
Similar with previous chart, the correlation shown by this graph (Fig. 13) is not strong, as well.
Nevertheless, the relationship between income and the quantity of customers’ mobile phone is extremely strong.
In this table (Fig. 14), the correlation between ‘searching ration in weekdays’ and ‘average number of searching per day’ is strong.
Besides, ‘total number of searching’ and ‘average quantity of searching per day’ shows a Moderate correlation.
This chart (Fig. 15) demonstrates the relationships between different industries.
Among these industries, ‘industry 1’ and ‘industry 0’ show a Strong negative correlation. The correlation between rest of industries is so weak that can be ignored.
2019101198 03 Oct 2019
This chart (Fig. 16) shows the Pearson correlation between different regions.
The correlation between each region are all negative which numerical value is so small that can be ignored. But between ‘region 17’ and ‘region 6’, there is a weak correlation.
This is a chart (Fig. 17) about star level and type of network, ‘type 5’ and ‘star_level 7’ show a weak correlation, ‘type 0’ and ‘type 1’ shows a negative correlation
The graph (Fig. 18) shows the relationship between different development channels.
The correlation between most of develop channels can be ignored, however, between develop channel o and develop channel 11/9/8/6/5/3/1, the correlation is relative strong.
There are the correlations between various Web Service Bundle (Fig. 19).
There is weak correlation between ‘type 2’ and ‘type 0’ and a strong negative correlation between ‘type 1 ’ and ‘type 0’
Overall, the features of service suspending customers have some inner relationship, even if service suspending customers are not highly
2019101198 03 Oct 2019 correlated with each single variable. Among service suspending customers, the relationships between age / inner_data, searching ration in weekdays / average searching times per day, number of searching / average searching times per day, star level 7 / type of network 5, net_flux_avg_MinMax / net_inner_dur_avg_MinMax, net_flux_avg_MinMax / s_cnt_MinMax
And the Pearson correlation coefficients are negative and shows a strong relationship between customer type/is_visit_other_url, pay_type/store_balance_MinMax, brand_type 0/brand_type 1, brand_type2/brand_type0, hangy_typeO/hangy_type 1, latn_id6/latn_id 17, wllxO/wllxl, wllx2/wllx0, develop_channel_newO/develop_channel_newl ,3,5,6,8,9,11, kd_mix_typeO/kd_mix_type 1, kd_mix_type0/kd_mix_type2
Service cancelling characteristics analysis
2019101198 03 Oct 2019
In general, the features of service canceling users are similar to that of service suspending users, even if there are slight disparities.
Characteristics of lost users (1) correlation of feature
The correlation of feature of lost users is as same as that of service suspending users and service cancelling users (2) selection of characteristics via using random forest, AdaBoost, ExtraTree, GradientBoosting, DecisionTree, to obtain the best parameters of the characteristics and the precise prediction.
And uniformise the prediction given by each classifier, then take average and figure out the best features: net_inner_dur, account_arpu, cust_arpu, send_amount, inner_date, net_times, net_flux, the amount deposited, age.
Table. 8 Parameters of the selected characteristics
2019101198 03 Oct 2019
feature | RF S MM | ADA S MM | ET S MM | GB S MM | DT S MM | MEAN |
n et jn π er d u r avg M i π Max | 0.9350 | 0.8421 | 1.0000 | 0.7525 | 0.7692 | 0.8598 |
accau nt arpu avg Min M ax | 0.8769 | 0.0000 | 0.7779 | 1.0000 | 1.0000 | 0.7310 |
cu st a rpu avg M i nMax | 1.0000 | 0.0000 | 0.7601 | 0.9220 | 0.8030 | 0.6970 |
send amount MinMax | 0.8451 | 0.6842 | 0.4953 | 0.5298 | 0.7194 | 0.6547 |
i nner date M ί n Max | 0.8980 | 0.0526 | 0.8069 | 0.6829 | 0.7347 | 0.6350 |
n et ti mes avg M inMax | 0.8110 | 0.0526 | 0.7170 | 0.7407 | 0.7392 | 0.6121 |
n et flux a vg M i nMax | 0.8824 | 0.0000 | 0.7444 | 0.6364 | 0.5654 | 0.5657 |
store balance MinMax | 0.3732 | 1.0000 | 0.3922 | 0.3386 | 0.4099 | 0.5028 |
age_MinMax | 0.7948 | 0.0000 | 0.7197 | 0.5076 | 0.4730 | 0.4990 |
Data Modeling
Currently, we use GaussianNB, LogisticRegression,
DecisionTreeClassifier, ExtraTreesClassifier, AdaBoostClassifier, MLPClassifier, RandomForestClassifier classifiers. Next step is the determine which classifier is more suitable for this invention. This invention mainly uses Bagging combined algorithm to find the best classification method.
We analyze ROC of each classifier and Bagging combined algorithm State and find that Extra Trees Classifier has the worst performance, while the combined algorithm AUC=0.78, which need to be optimized.
Model Optimizing
According to the model we made previously, the effect of AUC=0.78, it still needs to be optimized, and the main optimization ideas are as following:
2019101198 03 Oct 2019
1) In terms of features, there are many features with non-normal distribution, which we plan to treat logarithmically.
2) In terms of model processing, there is still a lack of fitting at present, and we are hoped to add cross validation to find the best parameters.
3) In the combination mode, Extra Trees Classifier, which has the worst effect, is eliminated.
Optimization of lost users
After taking the logarithm of the original data, we start to train the model. After the first training of the combination model AUC=0.79, we found that naive bayes and logistic classifier were not very effective. In the second training, the above two classifiers were eliminated, and the final combined model AUC=0.823
Optimization for service suspending users
Similar to the previous process, we take the logarithm of the original data, and then we start to train the model.In the first training combination model, AUC=0.832, we found that naive bayes and logistic classifier were not effective. In the second training, the combination of Decision Tree, AdaBoost and Random Forest with the highest scores in the previous training was adopted, and the final AUC=0.863
2019101198 03 Oct 2019
Optimization of service cancelling users
Similar to the previous process, we take the logarithm of the original data, and then we start to train the model. In the first training of the combination model AUC=0.739, we found that naive bayes, logistic and mlp-nn classifier had relatively poor effects. The second training employed DecisionTree and AdaBoost combination, with the combined model AUC=0.783, slightly superior to AdaBoost classifier AUC=0.780 and we finally decided to use AdaBoost classifier
The Future Plan
The results can basically meet the needs. We hope to optimize our model in the future.
For some features, plan to combine with the business, and study whether other feature indicators can be derived, such as binding expiration time, recent revenue trend, etc.
In the combination mode, Extra Trees Classifier, which has the worst effect at present, is eliminated, and in addition, the Stacking mode integration is attempted.
In terms of model processing, it seems that there may be a lack of fitting at present, and we are ready to add cross validation to find the best parameters.
2019101198 03 Oct 2019
Next, we plan to export the model and make it into a small tool, which can be embedded into the existing business App or web page to provide auxiliary reference for business promotion in the territory.
Claims (4)
1. A statistical analysis method of mobile telecom data driven user loss prediction, wherein obtain millions of data from China Telecom of their users’ information and behavior in the last three months to analysis the relationships between each characteristic and the loss of users; and this model can also be used in the future user loss prediction.
2. According to method of claim 1, wherein uses many data collection and processing tools, including missing data complementary and elimination, data normalization, conversion processing, data re-coding, data segmentation, data consolidation and etc.
3. According to method of claim 1, wherein mainly uses bagging combined algorithm to find the best classification method.
4. According to method of claim 1, wherein provide an analysis method and a powerful theoretical and policy basis for China Telecom to update big data in the broadband market competition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2019101198A AU2019101198A4 (en) | 2019-10-03 | 2019-10-03 | A statistical analysis method of mobile telecom data driven user loss prediction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2019101198A AU2019101198A4 (en) | 2019-10-03 | 2019-10-03 | A statistical analysis method of mobile telecom data driven user loss prediction |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2019101198A4 true AU2019101198A4 (en) | 2020-01-16 |
Family
ID=69146757
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2019101198A Ceased AU2019101198A4 (en) | 2019-10-03 | 2019-10-03 | A statistical analysis method of mobile telecom data driven user loss prediction |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2019101198A4 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111309718A (en) * | 2020-02-19 | 2020-06-19 | 南方电网科学研究院有限责任公司 | Distribution network voltage data missing filling method and device |
CN112686718A (en) * | 2021-03-19 | 2021-04-20 | 深圳索信达数据技术有限公司 | Method and device for acquiring user loss reason, computer equipment and storage medium |
-
2019
- 2019-10-03 AU AU2019101198A patent/AU2019101198A4/en not_active Ceased
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111309718A (en) * | 2020-02-19 | 2020-06-19 | 南方电网科学研究院有限责任公司 | Distribution network voltage data missing filling method and device |
CN111309718B (en) * | 2020-02-19 | 2023-05-23 | 南方电网科学研究院有限责任公司 | Distribution network voltage data missing filling method and device |
CN112686718A (en) * | 2021-03-19 | 2021-04-20 | 深圳索信达数据技术有限公司 | Method and device for acquiring user loss reason, computer equipment and storage medium |
CN112686718B (en) * | 2021-03-19 | 2021-06-29 | 深圳索信达数据技术有限公司 | Method and device for acquiring user loss reason, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107562818B (en) | Information recommendation system and method | |
US20090028183A1 (en) | Platform for communicating across multiple communication channels | |
CN107688967A (en) | The Forecasting Methodology and terminal device of client's purchase intention | |
AU2019101198A4 (en) | A statistical analysis method of mobile telecom data driven user loss prediction | |
CN113051291A (en) | Work order information processing method, device, equipment and storage medium | |
CN114547475B (en) | Resource recommendation method, device and system | |
CN108564255A (en) | Matching Model construction method, orphan's list distribution method, device, medium and terminal | |
Napitu et al. | Twitter opinion mining predicts broadband internet's customer churn rate | |
CN113111250A (en) | Service recommendation method and device, related equipment and storage medium | |
CN107977855B (en) | Method and device for managing user information | |
CN112036631B (en) | Purchasing quantity determining method, purchasing quantity determining device, purchasing quantity determining equipment and storage medium | |
CN113989020A (en) | Loan overdue information processing method and device, computer equipment and storage medium | |
CN111104603A (en) | Real-time hybrid recommendation method and system based on Lambda architecture | |
JP2002297875A (en) | Customer relation management method, system and program | |
US20210357953A1 (en) | Availability ranking system and method | |
CN115545886A (en) | Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium | |
AU2014204120A1 (en) | Priority-weighted quota cell selection to match a panelist to a market research project | |
Martín et al. | A numerical analysis of allocation strategies for the multi-armed bandit problem under delayed rewards conditions in digital campaign management | |
CN112288402A (en) | Data processing method, device, equipment and storage medium | |
CN115880077A (en) | Recommendation method and device based on client label, electronic device and storage medium | |
CN113641654B (en) | Marketing treatment rule engine method based on real-time event | |
CN117725313B (en) | Intelligent identification and recommendation system | |
JP7368897B1 (en) | information processing equipment | |
Yang et al. | Analysis on marketing ability and financial performance in internet company | |
Petrovic | Adopting Data Mining Techniques in Telecommunications Industry: Call Center Case Study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |