CN102402566A

CN102402566A - Web user behavior analysis method based on Chinese webpage automatic classification technology

Info

Publication number: CN102402566A
Application number: CN2011102278003A
Authority: CN
Inventors: 孙建; 张梅琴; 张顺颐; 王攀
Original assignee: JIANGSU XINWANG TEC TECHNOLOGY CO LTD
Current assignee: JIANGSU XINWANG TEC TECHNOLOGY CO LTD
Priority date: 2011-08-09
Filing date: 2011-08-09
Publication date: 2012-04-04

Abstract

The invention provides a web user behavior analysis method based on a Chinese webpage automatic classification technology, which adopts a naive Bayes classification method, automatically infers the category of a webpage browsed by a web user by using the category probability and the joint distribution probability of characteristic items, and analyzes the internet surfing habit of the web user on the basis of webpage classification to obtain a user behavior analysis result. The key technology of the invention is to construct a dynamic training set which can be automatically updated according to the classification accuracy index, so that the training set has timeliness and representativeness. The method is totally divided into four modules: the system comprises a data processing module, a feature extraction module, a webpage classification module and a user behavior analysis module. The data processing module is mainly used for acquiring basic information of a user and source codes of webpages browsed by the user and extracting a Chinese part from the source codes. The feature extraction module is mainly used for screening out feature items capable of describing webpage category features and finally expressing the feature items into a vector form.

Description

Web user behavior analysis method based on the Chinese web page automatic classification technology

Technical field

The invention provides a kind of web user behavior analysis method based on the Chinese web page automatic classification technology; Adopt the Naive Bayes Classification method; The joint distribution probability automated reasoning of use classes probability and characteristic item goes out the classification of the webpage that web user browses; Online custom to web user on the basis of Web page classifying is analyzed, and draws the user behavior analysis result.Gordian technique of the present invention is to have constructed a dynamic training set, can upgrade automatically according to the classify accuracy index, and it is ageing and representative to make training set have more.This method relates to fields such as artificial intelligence, user behavior analysis, Web page classifying, network management.

Background of invention

The fast development of Internet has brought the sharp increase of number of users, and the user is also increasingly high to the requirement of network.The colony of analysis user constitutes and the custom hobby, and the service to the user provides more personalized has become an important research direction.And,, also be the important evidence of planning, design and the management of network to the research of Internet and user behavior thereof along with the diversification of business.

When collection was used for the data of analysis user behavior, we can obtain the URL of the website that the user visits, but and do not know what classification these URL belong to, therefore need URL and the concrete meaning of one's words (like physical culture/finance and economics/military affairs etc.) be mapped.Set up complete, accurate, a dynamic automatic webpage classification system, just can obtain classification under it through URL.On the basis of known access site classification, can carry out depth analysis to the Web business, the network behavior of digging user is known user's behavioural habits and hobby trend, thereby is that the service that provides personalized provides foundation.

Summary of the invention:

Technical matters: the invention provides a kind of web user behavior analysis method based on the Chinese web page automatic classification technology; Adopt the Naive Bayes Classification method; The joint distribution probability automated reasoning of use classes probability and characteristic item goes out the classification of the webpage that web user browses; Online custom to web user on the basis of Web page classifying is analyzed, and draws the user behavior analysis result.Gordian technique of the present invention is to have constructed dynamic training set; An index and a threshold value of estimating classify accuracy is set; After accomplishing, each classification calculates the preparation index of this classification results; If the accuracy index of classification results greater than threshold value, is then upgraded training set automatically, the webpage vector of webpage to be measured is added in the related category of training set.Compare with muscle-setting exercise collection in the past, dynamic training set has ageing and representative more, can make that classification results is more accurate.

Technical scheme: the present invention proposes a kind of web user behavior analysis method based on the Chinese web page automatic classification technology, and its concrete performing step is following:

(1) data acquisition.According to the demand Information Monitoring, mainly be the essential information and the URL that extracts user institute browsing page of gathering Web user.

(2) the webpage source code extracts.Obtain the source code of webpage according to URL, and remove information such as Html mark, text, image, client's script, only stay pure Chinese text.

(3) participle.Adopt maximum double to matching method, mate, the content of Chinese Web text is cut into the set of some entries compositions through entry with Chinese dictionary.

(4) screening keyword.The screening keyword is divided into the screening of key term length and removes two steps of duplicate key speech.At first, the scope of entry is restricted between 2 to 4, the entry in this scope is not little even play interference effect to the classification effect, and these entries are rejected.Then, the entry that repeats to occur in each text is only write down once, and the relevant with it word frequency of record, can improve computing velocity, reduce miscount.

(5) confirm characteristic item.Satisfy x between Chinese keyword in the webpage is generic ²Distribute, so adopt x ²Statistical method is confirmed characteristic item.Calculate the frequency of keyword in of all categories earlier, pass through x then ²Statistical formula is come compute statistics, and bigger preceding 1000 keywords of last selection statistic are as characteristic item.

(6) webpage vector representation.Write down selected characteristic item and relevant with it word frequency, and represent with the form of vector.The element of webpage vector is a characteristic item, and element value is the word frequency of characteristic item in this webpage text.

(7) carry out Web page classifying with the Naive Bayes Classification method.As prior probability, the joint distribution probability of characteristic item is managed theorem according to Bayes and can be obtained posterior probability as conditional probability with class probability.Select the classification of the maximum classification of posterior probability as webpage to be measured.

(8) upgrade training set.A measure index and a threshold value of estimating the classification results accuracy rate is set; After accomplishing, each classification calculates the preparation index of this classification results; If the accuracy index of classification results is then upgraded training set greater than threshold value, the webpage vector of webpage to be measured is added in the related category of training set.Otherwise, keep original training set constant.

(9) Web user behavior analysis.Make up different querying conditions; In conjunction with the classification information of user basic information with the webpage of being browsed; Can draw the distribution situation that user under the different condition browses dissimilar Web webpages; Can draw Web user's behavioural habits and hobby trend according to these information, help to provide personalized more service.

Beneficial effect

Through the web user behavior analysis method based on the Chinese web page automatic classification technology, we can realize:

(1) can upgrade training set automatically according to the classify accuracy index, compared to muscle-setting exercise collection in the past, dynamically training set has ageing and representative more.

(2) on the basis of training set real-time update, adopt the Naive Bayes Classification method to automatic webpage classification, its classification results is more accurate.

(3) based on the Web page classifying result, in conjunction with user's essential information, can carry out deeper mining analysis to Web user's behavior, make analysis result tend to hobby near user's behavioural habits more.

Description of drawings

Fig. 1 is module frame figure of the present invention.

Embodiment

Be elaborated below in conjunction with the technical scheme of accompanying drawing to invention:

The invention provides a kind of Web user behavior analysis method based on the Chinese web page automatic classification technology; Adopt the Naive Bayes Classification method; The joint distribution probability automated reasoning of use classes probability and characteristic item goes out the classification of the webpage that Web user browses; Online custom to web user on the basis of Web page classifying is analyzed, and draws the user behavior analysis result.Its concrete steps are following:

(1) data acquisition.According to the demand Information Monitoring, mainly be the essential information and the URL that extracts user institute browsing page of gathering Web user.User basic information comprises the time of user's IP address, ownership place, browsing page, the IP packet byte number of reception, the IP packet byte number of transmission, the IP bag number of reception, the IP bag number of transmission.

(2) the webpage source code extracts.Extracting the webpage source code is to find the Web text through Web URL, reads the content of Web text.The Web text has comprised a large amount of Html marks, text, image, client's script, and reply Web text carries out pre-service when extracting the webpage source code, and the Html mark of removing, image, client's script only stay pure Chinese text information at last.

(3) participle.Owing to be the separation mark that not have demonstration between the speech of Chinese language literal and the speech; Must each entry in the flow be separated; Under the support of Chinese dictionary, the content of Chinese Web text is cut into the vector that some entries are formed, through with Chinese dictionary in entry mate to come participle.Its main thought is following:

1) presorts speech.Utilize non-Chinese symbols such as punctuate, numeral, English that sentence is cut into a plurality of Chinese character strings;

2) the basic segmenting method of two-way maximum match method conduct that adopts forward maximum match (MM) and reverse maximum match method (RMM) to combine.Two-way all the employing increases the word maximum match, and a cut-off begins progressively to increase backward word from the sentence head, be sky until Chinese character sequence to be slit.The result of this time cutting is the maximum word string of succeeding and mating.

The step of two-way maximum match method is following:

1. get preceding 6 Chinese characters in the current Chinese character sequence of sentence as matching field; Search dictionary,, then mate successfully if in the dictionary a such entry is arranged; Matching field is cut out from current Chinese character sequence as a speech; Put into entry and concentrate, continue execution in step 1., otherwise execution in step 2.;

2. remove behind Chinese character of this matching field afterbody as new matching field; Again with dictionary in entry mate; If the match is successful, then new matching field cuts out from current Chinese character sequence as a speech and puts into entry and concentrate, otherwise continues execution in step 2..If last looking up Chinese characters dictionary is all mated unsuccessful, then this Chinese character is cut out from the current character sequence and put into entry and concentrate;

3. if the current Chinese character sequence of text is not empty, then change step 1., otherwise finish.

(4) screening keyword.Remove length violation and the entry that repeats.Its concrete steps are following:

1. entry length screening, between 4, the entry in this length range is not considered to not quite even play interference effect to the classification effect, and these entries are rejected with the length restriction to 2 of all entries;

2. the entry uniqueness is done qualification, the entry that repeats in each text is only write down once, and write down associated word frequency.All entry frequencies in total vocabulary text are restricted to once,, reduce miscount to improve computing velocity.

(5) confirm characteristic item.Satisfy x between Chinese keyword in the webpage is generic ²Distribute, so adopt x ²Statistical method is confirmed characteristic item.This statistics value is high more, and the independence between keyword is generic is more little, and correlativity is strong more, and promptly keyword acts on big more to such other.x ²Statistical formula is as follows:

χ^{2} = \frac{N \times {(N_{ij} \times N_{i^{'} j^{'}} - N_{{ij}^{'}} \times N_{i^{'} j})}^{2}}{(N_{ij} \times N_{{ij}^{'}}) \times (N_{i^{'} j} \times N_{i^{'} j^{'}}) \times (N_{ij} \times N_{i^{'} j}) \times (N_{{ij}^{'}} \times N_{i^{'} j^{'}})}

Wherein, N _IjBe the frequency that keyword i occurs in classification j, N _{I ' j}Be the frequency that occurs in keyword i other classifications outside classification j, N _{I ' j}The frequency that all entries occur in classification j except that keyword i, N _{I ' j '}Be the frequency that all entries except that keyword i occur in other classifications outside classification j, N is the frequency summation of all keywords.

The concrete steps of confirming characteristic item are as follows:

1. calculate the frequency that each keyword occurs respectively in the difference classification, then with all frequency summations;

2. four kinds that calculate between every pair of different keyword and the classification concern frequency.Then according to x ²Computing method obtain the x of each keyword i to classification j ²Statistical value;

3. with all x ²Statistical value is got preceding 1000 keywords as characteristic item by descending sort, accomplishes confirming of characteristic item;

(7) carry out Web page classifying with the Naive Bayes Classification method.Adopt the Naive Bayes Classification method, the joint distribution probability of use classes probability and characteristic item is inferred the classification of document.As prior probability, the joint distribution probability of characteristic item is managed theorem according to Bayes and can be obtained posterior probability as conditional probability with class probability.Select the classification of the maximum classification of posterior probability as webpage to be measured.

Following mask body is introduced the principle and the step of Naive Bayes Classification method.

Make C={c ₁, c ₂..., c _kBe the set of classification, D={d ₁, d ₂..., d _nBe training set, n _jBelong to classification c in the expression training set _jNumber of files, n representes the training set total sample number, d is a document to be measured, adopts Laplce's probability estimate to calculate, and can obtain classification c _jPrior probability P (c _j), as follows:

P (c_{j}) = \frac{{1 + n}_{j}}{n + n_{j}} (j = 1,2, . . ., k)

Document d to be measured is made up of the characteristic item that it comprises, i.e. d=(w ₁, w ₂..., w _m), adopt Laplce's probability estimate to come calculated characteristics item wi to belong to classification c _jProbability, can draw conditional probability P (d|c by characteristic independence condition _j), as follows:

P (d | c_{j}) = Π_{i = 1}^{m} P (w_{i} | c_{j}) = Π_{i = 1}^{m} \frac{1 + TF (w_{i} | c_{j})}{V + Σ_{s = 1}^{v} TF (w_{s} | c_{j})}

Wherein, TF (w _i| c _j) representation feature item w _iAt classification c _jThe word frequency summation of all documents, V representes the characteristic item sum in the webpage vector.

Because for each classification P (d) all is a constant, according to Bayes' theorem, select the maximum classification of posterior probability, be choosing and then make product P (c _j) P (d|c _j) maximum classification.

Its concrete steps are as follows:

1. calculation training is concentrated the total sample number of the number of files and the training set that belong to of all categories;

2. calculate prior probability according to the prior probability formula;

3. calculate the characteristic item number in the text to be measured, calculate the word frequency number of each characteristic item in different classes respectively;

4. calculate the conditional probability of each classification respectively according to the conditional probability computing formula;

5. corresponding to each classification, try to achieve the product of prior probability and conditional probability;

6. select the classification of the maximum classification of the product of prior probability and conditional probability as document to be measured.

(8) upgrade training set.Index and the threshold value of estimating classify accuracy at first are set, and the computing formula of its accuracy index ES is as follows:

ES = \sqrt{Σ_{i = 1, i &NotEqual; s}^{k} {(P (c_{i}) P (d | c_{i}) - P (c_{s}) P (d | c_{s}))}^{2}}

Wherein, P (c _i) P (d|c _i) be the prior probability of classification under the document to be measured and the product of conditional probability, P (c _s) P (d|c _s) be the prior probability of other classifications and the product of conditional probability.

The computing formula of threshold value Threshold is as follows:

Threshold = \frac{1}{n} Σ_{i = 1}^{k} n_{i} (P (c_{i}) P (d | c_{i})

Wherein, n representes training set total sample number, n _iBelong to classification c in the expression training set _jNumber of files, P (c _i) P (d|c _i) be the prior probability of all categories and the product of conditional probability.

Each classification is accomplished the back and is calculated the preparation index of this classification results according to formula, if the accuracy index ES of classification results then upgrades training set greater than threshold value Threshold, in the webpage vector sum correlation type adding training set with webpage to be measured.Otherwise, keep original training set constant.

(9) Web user behavior analysis.Make up different querying conditions, in conjunction with the classification information of user basic information with the webpage of being browsed, the behavioural habits of analysis user.User basic information comprises the time of user's IP address, ownership place, browsing page, the IP packet byte number of reception, the IP packet byte number of transmission, the IP bag number of reception, the IP bag number of transmission; Add the webpage classification; Can draw 8 independent conditions; According to the permutation and combination principle, make up these 8 different independent conditions and can obtain 2 ⁸Individual compound condition; Remove the condition of some no practical significances then; Finally can obtain 26 conditions, inquire about, can draw the distribution situation that user under the different condition browses dissimilar Web webpages according to these 26 conditions with actual value; Can draw Web user's behavioural habits and hobby trend according to these information, help to provide personalized more service.

Claims

1. Web user behavior analysis method based on the Chinese web page automatic classification technology is characterized in that the step of this method:

(5) confirm characteristic item.Satisfy χ between Chinese keyword in the webpage is generic ²Distribute, so adopt χ ²Statistical method is confirmed characteristic item.Calculate the frequency of keyword in of all categories earlier, pass through χ then ²Statistical formula is come compute statistics, and bigger preceding 1000 keywords of last selection statistic are as characteristic item.