CN106339502A

CN106339502A - Modeling recommendation method based on user behavior data fragmentation cluster

Info

Publication number: CN106339502A
Application number: CN201610828355.9A
Authority: CN
Inventors: 陆鑫; 邓玉林
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-09-18
Filing date: 2016-09-18
Publication date: 2017-01-18

Abstract

The invention relates to an internet personalized recommendation technology and particularly relates to a modeling recommendation method based on user behavior data fragmentation cluster. According to the modeling recommendation method, the user behavior data is subjected to fragmentation cluster treatment, a user dynamic interest model is established, so that the personalized recommendation is realized. Compared with the existing personalized recommendation method, the modeling recommendation method has the following differences: the existing personalized recommendation method only considers user interest dynamic time-varying characteristics, while the modeling recommendation method not only considers user interest time-varying characteristics, and also excavates multi-dimensional discrete interest points from behavior data, so that a user interest model is depicted more accurately. According to the modeling recommendation method, aiming at the multi-dimensional discrete interest theme of target users, the concurrence of interest points of users is preliminarily recommended, and finally, the weight, memory and preliminary recommendation result of the interest points of the target user are predicted and scored to be finally recommended, so that the accuracy and the processing capability of the personalized recommendation result are improved.

Description

A kind of modeling recommendation method based on user behavior data burst cluster

Technical field

The present invention relates to Internet technology, particularly to a kind of modeling recommendation side based on user behavior data burst cluster Method.

Background technology

With internet application development, problem of information overload is more and more prominent in the world today.User is from magnanimity information Finding oneself information interested is an extremely difficult thing.Personalized recommendation technology is passed through to analyze user's a large amount of behavior number According to the interest preference carrying out digging user and potential demand, processed by personalized recommendation system, thus recommending its sense emerging for user The service of interest, commodity or the information content.At present, personalized recommendation technology be widely used in ecommerce, social networks, The fields such as location-based service, search service, advertising service.Wherein, foremost is exactly the ecommerce such as Amazon, Taobao, Jingdone district Platform, it is recommended that system about which increases 20% to 30% sales volume, brings golden eggs.And search engine is as people Conventional information retrieval tool in daily life, after user uses search engine it is also possible to obtain user interested in Hold theme.User interest theme acquired in search engine and access behavioural information are introduced commending system can more accurately carve Draw user interest model.

By analyzing the feature of user behavior data in electronic business platform, find that the interest of each user is not one Become constant, in dynamic characteristics such as certain time variation, multi-dimensional nature, discretenesses in generally changing with Spatio-temporal factors.For example, when with During the out on tours of family, its interest is the information such as local transit, hotel, food and drink, local conditions and customs；Operationally, its interest is user Obtain be engaged in trade information；In amusement and recreation life, its interest is to obtain the amusement letter such as video display, music, news, physical culture Breath.Additionally, these interest of user also can be drifted about in time, that is, embody point of interest Dynamic Changes, this interest transition are led to Often also there is certain discreteness, such as when user likes the recreational and sports activities of stimulation of taking a risk at an early age, then like easypro after stepping into the middle age Slow easily stress-relieving activity.Therefore, when analyzing its AOI based on user behavior data, need to take into full account user interest Time variation, multi-dimensional nature, discreteness feature, precisely portray the current interest model of user to reach.

From the point of view of existing user dynamic interest model method, it is broadly divided into based on sliding window with based on time parameter Model method.Based on sliding window model method be by arrange a fixed size sliding time window, this window with The passage of time ceaselessly moves forward.Need only to during user interest is excavated consider the number in current window According to the data outside window that falls then may be considered the interest before user, can not pay attention to.This method realizes difficulty Less, but Orientation observation time window size be difficult to setting because user is different, and mean to neglect using sliding window Depending on the user data beyond sliding window, thus leading to omit the wide scope interest of user.Model method based on time parameter There are a variety of schemes, the representative model scheme being namely based on forgetting curve, using the history to user for the forgetting curve Score data is changed accordingly, that is, give the weights of correlation.When amended scoring is less than certain threshold value, just abandon this Scoring.It is based on an assumption that during building user interest model, user is recent based on the model method of time parameter Behavioral data more important than the behavioral data of user's history because these data more can reflect the current interest of user, away from Away from now more, data more can not reflect user's current interest.

Above user dynamic interest model method is typically only applicable to the situation that those interest gradually change, less applicable In the situation of user interest jump change, that is, user is larger to another one point of interest span from a point of interest.Particularly exist Have in the large-scale synthesis class electric business plateform system of search engine, user interest can over time, place, the factor such as wish occur Great changes, and assume certain discreteness.It is can not accurately to portray the dynamic of user interest model only according to time parameter Change.

In sum, user when accessing large-scale synthesis class electric business plateform system, its point of interest has dynamic time variation, many Dimension property and discreteness feature.Dynamic time variation refers to that user interest theme can vary over.For example, user daytime can Can be interested in job information, evening is then interested in life ＆ amusement.Multi-dimensional nature refers to that user interest theme has in different aspect Multiple different hobbies.For example, user, in terms of study, has multiple difference sections purpose hobby；In terms of amusement and recreation, hobby Different activities.Discreteness refers to that between multiple interest topics of user, span is larger.For example, user is in terms of tourism and work The interest of aspect there is certain discreteness.Traditional personalization recommends method typically to adopt collaborative filtering, and this algorithm Principle is that the commodity liked by finding the user having similar behavior to targeted customer are recommended for targeted customer.But due to There is larger difference and complexity in the interest topic between user, this brings for the interest topic Similarity Measure between user Difficult.It is assumed that user a, b, c have job interest and life interest.The job interest similarity of a and b and c is respectively 50% With 10%, and life interest similarity is respectively 50% and 85% it is clear that the similarity thinking a and b that can not be average is higher than a Similarity with c.Therefore traditional personalized recommendation method does not comprehensively consider the multi-dimensional nature of user interest and discrete sex chromosome mosaicism, Personalized recommendation precision under the dynamic interests change of user can not be solved the problems, such as very well.In addition, user puts down in access electric business It will usually obtain oneself required information using search engine functionality in platform during platform.If commending system is not to retrieval Keyword and its browse data and carry out cluster analysis, is just difficult to the interest point range of focused user behavioral data, is unfavorable for improving The process performance of commending system and recommendation precision.

Content of the invention

The present invention limits to for the technology that existing personalized recommendation method exists under user's dynamic interest scene, proposes one Plant and method is recommended based on the modeling of user behavior data burst cluster.

Technical scheme: a kind of modeling recommendation method based on user behavior data burst cluster, its feature exists In, comprising:

A. user behavior data customized treatment, specifically includes:

A1. user behavior data collection；When described user behavior data refers to that user passes through internet access electric business platform, The user behavior data that electric business platform is gathered, at least includes the categorical data such as logging in, retrieve, browse, buy and evaluate, simultaneously Each user behavior data all includes the base attribute information of electric business platform imparting, and described base attribute information at least includes Session id, user id, behavior type, content of the act, user ip, logging device and time；

A2. user behavior data burst；Specifically the behavioral data of collection in step a1 is organized by user, then with User is unit to each transaction session of electric business platform, user behavior data is divided by transaction session, makes each stroke The behavioral data fragment divided only comprises an affairs theme, and the behavioral data fragment that this user is comprised similar topic word is carried out Merger is processed；Described transaction session refers to create in User logs in electric business platform, and during destruction after user terminates to access Between fragment；

B. pass through user behavior data cluster analysis establishment and use public user interest model, specifically include:

B1. after the behavioral data burst to different user for step a, using each behavioral data fragment as a class Not, calculate the similarity between all categories；Particularly as follows: assuming there is u_iAnd u_jTwo behavioral data fragments, then their descriptor Set similarity s (u_i,u_j) computational methods equation below 1:

Wherein, s (u_i,u_j) represent behavioral data fragment u_iAnd u_jBetween similarity, v (u_i) and v (u_j) represent behavior respectively Data slot u_iAnd u_jTheme set of words, calculate descriptor intersection of sets collection when, only when searching motif word is identical and has During identical part of speech, just think that two searching motif words are identical；

B2. by two categories combinations of similarity highest of gained be a classification, and using two classifications average phase Like the similarity spent as new category, repeat step b2 is till obtaining the classification of specified quantity；

B3. extract descriptor from each classification that step b2 finally obtains as interest topic, build public user emerging Interesting model；

C. electric business platform is recommended to user, method particularly includes:

Electric business platform is analyzed to the behavioral data of targeted customer, the public user interest model obtaining from step b In find out each point of interest of targeted customer, tentatively recommended respectively using collaborative filtering, then used in conjunction with target The weight of each point of interest in family, memory degree and the prediction scoring of preliminary recommendation results carry out consequently recommended.Assume that i-th point of interest accounts for mesh The weight of mark user interest is λ_i, the computational methods of this weight are: set s_iFor i-th point of interest of targeted customer, len (s_i) be Point of interest s_iIn targeted customer's behavior record number of comprising, then point of interest s_iAccount for weight λ of targeted customer's interest_iCalculation Shown in equation below 2:

According to user interest point forgetting law, using forgetting function h (t) to point of interest λ_iWeight is processed；It is assumed that t is In certain user interest point, last behavior record time of origin is to the time interval of recommendation time, then user interest point memory degree Calculation equation below 3 shown in:

H (t)=e^-t(formula 3)

Wherein, the unit of t is the moon；When last behavior record time of origin is identical with the time of recommendation, representative is spaced apart 0, then h (0)=1, represent user and forgetting is not had started to this interest.Finally, the preliminary recommendation results of each point of interest of user are entered Row weighted calculation sorts, and obtains point of interest sorted lists p.It is assumed that targeted customer has n point of interest, i-th point of interest recommends knot The prediction of fruit is scored as p_i, then the calculation of point of interest sorted lists p can be expressed as follows formula 4:

P=sort (p₁*λ₁*h(t₁),p₂*λ₂*h(t₂),p₃*λ₃*h(t₃),…,p_i*λ_i*h(t_i),…,p_n*λ_n*h(t_n)) (formula 4)

Wherein, the preliminary recommendation results prediction scoring of each point of interest of sort () function pair targeted customer, interest weight, memory Degree is weighted, and end value sequence is processed.p_iRepresent the preliminary recommendation results prediction scoring of this user interest point i, t_i Represent the interval that this point of interest extremely recommends the time, h (t_i) it is the memory degree to point of interest i for the user.Finally, sorted according to point of interest List items calculated value, selects train value highest point of interest recommendation results to be supplied to targeted customer, thus realizing considering user The personalized recommendation of the dynamic Characteristic of Interest of time variation, multi-dimensional nature, discreteness.

The method of the present invention is passed through to process user behavior data burst, numerous and disorderly behavioral data is pressed transaction session and organizes To fragment, and solve the process problem that behavioral data key words extraction and similar behavioral data merge.Simultaneously to user behavior number Carry out cluster analysis according to fragment, the behavioral data fragment of all users is carried out classification process by containing interest topic, excavates There is the interest point set of similar users behavior, and construct public user interest model, solve the dynamic interest model of user Portray precision Upgrade Problem.For the analysis of targeted customer's behavioral data, in the dynamic interest model of public user, obtain this use Family interest point set, and applicating cooperation filter algorithm is tentatively recommended respectively to each point of interest of targeted customer.Then to target The preliminary recommendation results prediction scoring of each point of interest of user, interest weight, memory degree are weighted, and end value are sorted Process, choose train value highest point of interest recommendation results and be supplied to targeted customer, thus solving user's dynamic interest time variation, many Personalized recommendation difficulties under dimension property, discreteness feature.

The present invention, as the personalized recommendation method of the dynamic interest of legacy user, is by analyzing user behavior data The dynamic point of interest of mode digging user, set up dynamic user interest model.Also utilize collaborative filtering to be directed to use simultaneously Family interest is recommended, and produces the personalized recommendation result based on the dynamic interest of user.The present invention and existing personalized recommendation side The different place of method is, existing personalized recommendation method only considers user interest dynamic time-varying implementations, and the present invention Not only consider the dynamic time variation of user interest moreover it is possible to excavate with multi-dimensional nature, discreteness user interest point in subordinate act data, Thus more accurately portraying the dynamic interest model of user.Present invention is alternatively directed to the multidimensional interest master with discretization of targeted customer Topic, is concurrently tentatively recommended for each point of interest of this user, finally by the weight of each point of interest, memory degree, preliminary recommendation results Prediction scoring is weighted, and the point of interest choosing highest calculated value realizes recommendation process, thus improving personalized recommendation knot The precision of fruit and process performance.

Beneficial effects of the present invention are that the method for the present invention is entered by user is accessed with the behavioral data of electric business plateform system Row burst cluster analysis is processed, solve user behavior data contain the time variation of interest topic, multi-dimensional nature, discreteness etc. process difficult Point problem, can accurately portray the dynamic interest model of user, thus providing basis for precisely realizing personalized recommendation.For being based on The multidimensional point of interest that cluster analysis is extracted, this method is tentatively recommended respectively, later in conjunction with currently each interest of targeted customer Point weight, memory degree and the prediction scoring of preliminary recommendation results carry out combined recommendation so that recommendation results are more accurate.Additionally, with now The dynamic interest personalized recommendation method having is compared, and the present invention carries out burst and merger and processes to user behavior data, for follow-up User behavior data cluster analysis processes and reduces expense.Equally, each multidimensional point of interest of user extracting for cluster analysis, holds Row parallelization personalized recommendation, can improve the process performance of commending system.

Brief description

Fig. 1 is the system structure diagram of the inventive method model；

Fig. 2 is the overview flow chart of the inventive method model treatment；

Fig. 3 is user behavior data burst process chart；

Fig. 4 is user behavior data cluster analysis flow chart；

Fig. 5 is the recommended flowsheet figure of targeted customer；

Fig. 6 is user behavior data gatherer process schematic diagram；

Fig. 7 is user behavior data slicing principle schematic diagram；

Fig. 8 is user behavior data fragment process of cluster analysis schematic diagram.

Specific embodiment

With reference to the accompanying drawings and examples the present invention is described in detail

In order to improve the recommendation precision of personalized recommendation system, need comprehensively to consider the time-varying of the dynamic interest of user The characteristics such as property, multi-dimensional nature and discreteness.In order to substantial amounts of for user behavioral data is effectively gathered and is facilitated analyzing and processing, this Invent in units of each transaction session that each user accesses electric business platform, by the involved visit in transaction session of this user Ask that operation is organized in a behavioral data fragment, the behavioral data of each user will carry out burst process.Due to user's Each affairs behavioral data fragment all contains certain interest or wish, and the present invention will analyze extraction each behavioral data fragment of user Descriptor so as to collection a large number of users behavioral data fragment be analyzed process.For certain user's different dimensions of classifying Behavioral data fragment, each behavioral data fragment of this user carries out merger by similar topic word by the present invention, and only retaining should The different themes behavioral data fragment of user.In addition it is also necessary to all users after each behavioral data fragment obtaining unique user Behavioral data fragment carry out cluster analysis, extract the behavioral data fragment collection in all users with similar interests theme Close.User behavior data in each set contains these users and has identical interest topic.Thus excavating all User has, multidimensional interest topic, and set up the interest model of public user according to these interest topics, to realize Personalized recommendation.Additionally, by continuous analysis user behavior data, new interest topic is added in user interest model, from And realize user interest model and dynamically update.When to targeted customer's execution personalized recommendation, find targeted customer's sense first emerging The theme set of interest, the user comprising then in conjunction with user behavior fragments all in this interest topic buys data and scoring number According to collaborative filtering, being that each interest topic of targeted customer executes personalized recommendation respectively.Finally, according to targeted customer Currently each point of interest weight, memory degree and the prediction scoring of preliminary recommendation results, be weighted and sort process, chooses highest The point of interest recommendation results of train value provide targeted customer.Its concrete process step is as follows:

1st, user behavior data burst.The once complete transaction session of user is defined as a behavioral data piece by the present invention Section, main sliced fashion is with the establishment of session and to destroy as boundary, using user operation data in this period as one Behavioral data fragment.

2nd, a large amount of behavior fragment datas being directed to each user carry out the merger process of similar topic.First, extract each row For the theme set of words (one or more) of data slot, and according to epigraph mark part of speech based on user's browsing content, thus solving The certainly merger problem of polysemy.Secondly, the similarity between each descriptor relatively in each behavioral data fragment, high similarity Behavioral data fragment merges, and is that subsequent user behavioral data cluster analysis processes minimizing expense.Finally, obtain having of this user The behavioral data fragment of multidimensional theme.

3rd, the potential point of interest of user is excavated by cluster analysis, set up the dynamic interest model of public user.Because having phase Behavioral data fragment like descriptor necessarily contains similar interest topic, so the present invention passes through all users of cluster analysis Behavioral data fragment characteristic vector, excavate the behavioral data set of segments with similar topic word, thus extracting use The multidimensional point of interest with discretization in family, and user interest model is built according to these user interest points.First, extract each behavior The tf-idf (descriptor weight) of descriptor and part-of-speech information in data slot, and for each behavioral data fragment generate feature to Amount.Secondly, the similarity between each behavioral data fragment is calculated according to characteristic vector, and calculate with bottom-up hierarchical clustering Iteration clusters each behavioral data fragment to method successively, obtains the behavioral data set of segments of similar interests.Then, by extracting each collection The higher descriptor of the tf-idf value of all behavioral data fragments in conjunction, just obtains each interest topic of all users, thus Set up public user interest model.

4th, it is that each point of interest of targeted customer executes personalized recommendation algorithm.First, for the behavioral data piece of targeted customer Section is analyzed, and from public user interest model, finds all points of interest of this user.Then, press each interest simultaneously Point parallelization ground execution collaborative filtering, each point of interest for this user produces personalized recommendation PRELIMINARY RESULTS respectively.

5th, according to targeted customer, currently the weight of each point of interest, memory degree and scoring are weighted, and process of sorting, The point of interest recommendation results choosing highest calculated value provide targeted customer.First, calculate the power of each point of interest of targeted customer respectively Weight, the prediction scoring of memory degree, recommendation results, and they are weighted.Then, the weighted calculation value of each point of interest is entered Row sequence, chooses weighted calculation value arrangement highest point of interest recommendation results as consequently recommended result.

As shown in figure 1, the inventive method model is related to electric business platform, behavioral data acquisition module, Users' Interests Mining mould Block, four parts of system recommendation module.Electric business platform is the application foundation of commending system, and it is except providing electronics for client Outside business service, also will record user in this platform database and log system and search for, browse, buying, evaluating the behaviour such as commodity Make behavioral data.Behavioral data acquisition module is responsible for gathering the use of correlation from customer data base, log system, merchandising database Family behavioral data and user's score data.Users' Interests Mining module carries out burst process to user behavior data, then carries again Take the characteristic vector of each behavioral data fragment, and cluster analysis carried out with this, digging user is multidimensional, the point of interest of discretization, Thus setting up the public user interest model of electric business platform.Recommending module is analyzed according to targeted customer's behavioral data, in the public Extract targeted customer's interest point set in user interest model, and provide targeted customer using collaborative filtering method for electric business platform Personalized recommendation.

In the inventive method model, Users' Interests Mining module is mainly by user behavior data burst, behavioral data piece The processing unit compositions such as section feature vector extraction, the calculating of behavioral data segment-similarity and behavioral data fragment cluster analysis. Wherein, user behavior data sharding unit carries out burst process to behavioral data in units of each transaction session of user, and The data slot of this user is carried out merger process by similar topic word, thus obtaining one group of behavioral data containing different themes Fragment.Behavioral data segment characterizations vector extracting unit is responsible for extracting the tf-idf value of descriptor in each behavioral data fragment, and Arrange each descriptor and its tf-idf value according to Chinese vocabulary table order, generate the characteristic vector of behavior data slot.Feature Vector represents the feature of user behavior data fragment, processes for calculating the similarity between behavioral data fragment.Behavioral data Segment-similarity computing unit is divided into two classes to calculate.First kind calculating is that all behavioral data fragments for unique user are carried out Similarity Measure, the behavioral data fragment merger for will have like descriptor is processed.Equations of The Second Kind is useful for platform institute The behavioral data segment characterizations vector at family carries out Similarity Measure, provides the similarity degree of data slot for cluster analysis unit Amount.Behavioral data fragment cluster analysis unit carries out cluster analysis to all user behavior fragment datas, excavates out one group and contains The data slot set of different themes.The data slot set of each theme has similar interests point, and then it is flat to build electric business The public user interest model of platform.System recommendation module is analyzed processing for targeted customer's behavioral data, and uses from the public Family interest model excavates out the interest point set of this targeted customer.Then, execute respectively for each point of interest of targeted customer collaborative Filtering recommendation algorithms generate preliminary recommendation results.Finally, by the weight of each point of interest of targeted customer, memory degree, preliminary recommendation Prediction of result scoring is weighted, and chooses calculated value highest point of interest recommendation results and ties as final personalized recommendation Really.

As shown in Fig. 2 the personalized recommendation method process of the present invention, be divided into public user interest model set up subprocess and Targeted customer recommends subprocess.Public user interest model is set up process and divides four steps to complete: first, user is accessing electric business platform During, plateform system records the peration data of each user automatically.Then, user behavior acquisition module from operating database and Gather the behavioral data of each user in daily record data database, and carry out behavioral data list organization by user.Afterwards, this mould Each user behavior data is carried out burst process by transaction session by type, obtains some user behavior data pieces containing descriptor Section, has the behavioral data fragment of same subject word in merger unique user.Finally, the behavioral data fragment of all users is entered Row cluster analysis, obtains some user behavior data set of segments containing different themes word, i.e. each data slot set is accumulate Containing similar interest topic, and the interest model of public user is built with this.

Targeted customer's personalized recommendation process divides three steps to complete: first, analyzes targeted customer's behavioral data fragment, according to public Many user interest models find all points of interest of targeted customer.Then, each interest being this user with collaborative filtering Point generates preliminary recommendation results.Finally, for the weight of each point of interest of this user, memory degree, the prediction scoring of preliminary recommendation results It is weighted, choosing calculated value highest point of interest recommendation results provides targeted customer, completes personalized recommendation process.

The processing method of key modules of the present invention is described below.

1st, user behavior data collection

User behavior data is the data basis that personalized recommendation is realized, and the inventive method not only needs to gather the inspection of user Rope is with navigation patterns data in addition it is also necessary to collection user buys and scoring behavioral data.Retrieval is mainly used in navigation patterns data The point of interest that digging user is dynamic, many peacekeepings are discrete, is bought and is then used at commercial product recommending in collaborative filtering with scoring behavioral data Reason.User behavior data mainly carries out data acquisition from customer data base, merchandising database and log system, steps on including user Five kinds of behavior classes such as record behavioral data, retrieval behavioral data, navigation patterns data, buying behavior and user's scoring behavioral data Type data.The every behavioral data being gathered not only needs to comprise session id, user id, commodity id, behavior type, content of the act Etc. information in addition it is also necessary to comprise the attribute informations such as timestamp, browsing terminal and place.These data are arranged by session id Sequence, generates user behavior data list, thus facilitating user behavior data burst to process and cluster analysis process.

2nd, user behavior data burst

In e-commerce platform, each session of user has clearer and more definite purpose, so the user behaviour in this session Make to contain certain interest topic.Therefore, the present invention is in units of transaction session of user, and user behavior data is carried out point Piece is processed.In order to support the efficient process of user behavior data fragment cluster analysis, this unit is also to user behavior data fragment Content carries out key phrases extraction, then carries out merger process to the behavioral data fragment with similar topic word.Its process such as Fig. 3 Shown.

From the figure 3, it may be seen that user behavior data Slicing procedure is divided into following key step:

1) read the behavioral data of unique user from the database of acquisition module, including user retrieval behavior, browse row For, buying behavior, scoring behavior, log in the behavioral datas such as behavior, reactive power optimization.

2) each for user complete transaction session is defined as a behavioral data fragment.Concrete grammar is according to each user It is a behavioral data fragment that session creates to session the sequence of operations destroying in the time period.

3) from each data slot content of user, extract user search and the descriptor browsing information, generate behavior number Theme set of words according to fragment.It is that corresponding descriptor gives different parts of speech according to browsing content information, thus solving nature Polysemy problem in language.

4) the behavioral data fragment in unique user with similar topic set of words is carried out merger process.Relatively each behavior The similarity of theme set of words in data slot, merges behavioral data fragment high for similarity, obtains one group of this user and have The behavioral data fragment of different themes word.

3rd, user behavior cluster analysis

User behavior cluster analysis is that the behavioral data fragment for all users is analyzed processing, and therefrom excavates public affairs Many user interest themes.It comprises user behavior data segment characterizations vector and extracts and user behavior data fragment cluster analysis two Individual processing unit.Behavioral data fragment because having similar topic word necessarily contains similar interest topic, so this module By calculating the similarity of each data slot descriptor, and adopt cluster analysis, excavate the behavior with similar topic word Data slot set, thus extracting the multidimensional point of interest with discretization of user, and builds user interest model with this.Its mistake Journey is as shown in Figure 4:

1) characteristic vector pickup of user behavior data fragment.The user behavior number being obtained according to behavioral data Slicing procedure According to the theme set of words of fragment, calculate the tf-idf value of each descriptor, this value is the tolerance of descriptor importance.By Chinese vocabulary Table order is arranged in order descriptor and its tf-idf value, just constitutes the characteristic vector of behavior data slot.

2) calculate the intersegmental similarity of each behavioral data piece.With each descriptor of characteristic vector for a dimension, build Vector space model.Then when two characteristic vectors are orthogonal, behavioral data segment-similarity is 0.When characteristic vector overlaps, OK It is 100% for data slot similarity.Therefore the intersegmental similarity of each behavioral data piece can be calculated using cosine law formula, obtain The cosine value arriving is the intersegmental Similarity value of each behavioral data piece.

3) run hierarchical clustering algorithm program and cluster analysis is carried out to each data slot.The present invention adopts bottom-up Hierarchical clustering algorithm, two most like behavioral data fragment classifications of continuous iteration cluster, and then complete all user behavior numbers Cluster process according to fragment.First, each behavioral data fragment is regarded as a classification.Then, by similarity highest two Categories combination is a class.Iteration successively, till reaching specified class number.

4) select optimal cluster level, determine cluster result.Bottom-up hierarchical clustering algorithm ultimately generates tree-shaped Cluster result.The level maximum by finding the change of similarity between class, just can determine that the cluster result of optimum, obtains having many Dimensionization, the user behavior data set of segments of the interest topic of discretization.

5) according to above-mentioned cluster result, set up the user interest model of the public.Each behavioral data piece from cluster result In Duan Jihe, extract tf-idf value highest descriptor in the behavioral data fragment comprising, obtain user interest theme (interest Point).These multidimensional, discretization user interest themes and its behavioral data set of segments are organized together, just constitutes Public user interest model.

4th, the personalized recommendation of targeted customer

When accessing electric business platform, there is dynamic time variation, multi-dimensional nature and discreteness in user interest, i.e. the not Tongfang of user Face interest exists compared with large span.Therefore, only recommended respectively for each point of interest of user, just can effectively improve personalization and push away Recommend precision.First, each point of interest for targeted customer executes personalized recommendation respectively, generates preliminary recommendation results.Then, It is weighted for the weight of each point of interest of this user, memory degree, the prediction scoring of preliminary recommendation results, choose calculated value High point of interest recommendation results provide targeted customer.Its handling process is as shown in Figure 5.

Fig. 5 is the personalized recommendation flow chart of targeted customer, and its step is as follows:

1) find targeted customer's each point of interest interested.Based on public user interest model, analyze the row of targeted customer For data slot, find out the interest point set of this user.

2) it is directed to each point of interest of targeted customer, execute Collaborative Filtering Recommendation Algorithm respectively, generate each point of interest Preliminary recommendation results.

3) targeted customer's each point of interest sequence calculates.To the weight of each point of interest of targeted customer, memory degree, preliminary recommendation knot Fruit prediction scoring is weighted, and they are ranked up processing.Its weighted calculation value and sequence reflection targeted customer work as Front degree interested in each point of interest.

4) generate personalized recommendation result.From the weighted calculation value list of each point of interest, choose the point of interest of peak Recommendation results are as final recommendation results.

Embodiment:

1st, user behavior data collection

From traditional personalized recommendation system only gather user buy, score data different, this example also need to gather user Retrieval behavior and navigation patterns data.Wherein, the every behavioral data being gathered not only needs to comprise session id, user The information such as id, commodity id, behavior type, content of the act are in addition it is also necessary to comprise the attributes such as timestamp, browsing terminal and place letter Breath.User behavior data burst for next step is processed to provide and supports by these primary attributes.Its concrete gatherer process such as Fig. 6 institute Show.

As shown in fig. 6, user behavior data acquisition module first from the customer data base of electric business platform, merchandising database with And gather out User logs in log system, the behavioral data of classification such as retrieve, browse, buying, scoring.Each behavioral data is equal Comprise base attribute information (as session id, user id, behavior type, content of the act, user ip, logging device, time etc.).Its In, for ensureing the integrality of user behavior data collection, electric business platform creates session when user starts access system, when User destroys this session information after exiting.After User logs in electric business platform, it will words id (id of session) are closed and are coupled to In the behavioral data list of this user.Log in the session information containing user in behavioral data, can be used for user behavior number Process according to burst.User search and navigation patterns data are mainly used in the point of interest that digging user is dynamic, many peacekeepings are discrete, so that Set up public user interest model.User buys and scoring behavioral data is then used for facilitating personalized recommendation.

2nd, user behavior Slicing procedure

Because each transactions access that user accesses electric business platform mostly carries interesting purpose, that is, this affairs all operations is all There is identical interest topic.Therefore in units of each transaction session of user, user behavior data is carried out data slot and draws Divide it is possible to make each fragment behavioral data contain a theme.Then, for all behavioral data pieces of each user Section, the similar topic word according to containing carries out merger process, thus process for follow-up Users' Interests Mining improving performance.User Behavioral data burst handling principle is as shown in Figure 7.

1) behavioral data of each user is read respectively from user behavior data storehouse, and by user's id organizational behavior data row Table.The behavioral data being read, in addition to comprising basic attribute data, further comprises behavior relevant operating data.

2) in units of user's single transaction session, one group of behavioral data in this affairs is divided into a behavioral data Fragment.Concrete grammar is with the establishment of user session and to destroy as boundary, by all behavioral datas of user in this time period As a behavioral data fragment, and filter nullity data (after logging in, exiting at once), reduce user behavior number According to noise.

3) extract the descriptor of behavioral data fragment.For user retrieval behavior data slot, the theme of its behavior segment Word is search key.For browsing and buying behavior fragment data, the present invention extracts by the following method and browses and purchase Buy the potential descriptor of content of the act.First by Chinese word segmentation software module, the text data of content of the act is carried out at participle Reason, and filter insignificant function word information, obtain the set of letters that content of the act comprises.Then, calculated using tf-idf algorithm The importance degree of each word.It is assumed that t_iThe number of times occurring for word i, t is the number of times that all words occur, then the tf-idf value of word i Computing formula is shown in formula 5.

Wherein, first by t_i/ t counts the word frequency information (term of word i in the detailed description of browsing content Frequency, is abbreviated as tf), then calculate inverse document frequency (the inverse document of word i in describing in detail Frequency, is abbreviated as idf), its computing formula is log (d/d_i), wherein d is entire service number, d_iRepresent word i in d_i Occur in individual descriptive labelling.Finally, calculate tf the and idf product of each word, obtain the importance degree of each word.Select importance degree High several words as browse and buy the potential theme set of words of content.Additionally, being the descriptor improving extraction further Precision in addition it is also necessary to browse or buy the attribute (classification, purposes etc.) of content according to user, based on epigraph add label, thus Solve the problems, such as polysemy.Definition k is user behavior data fragment descriptor, and s is the browsing content information of this descriptor, then have The behavioral data fragment theme set of words having n descriptor can be expressed as (k₁<s₁>,k₂<s₂>,…,k_i<s_i>,…,k_n<s_n >).

4) the behavioral data fragment that will have like descriptor merges.By calculating the theme word set of each behavioral data fragment The similarity closed, when similarity exceedes certain threshold value, (as 80%) merges this two behavioral data fragments.Behavioral data fragment Between Similarity Measure can with set cosine similarity computational methods.Suppose there is u_iAnd u_jTwo behavioral data fragments, then Their theme set of words similarity s (u_i,u_j) computational methods are shown in formula 6.

Wherein, s (u_i,u_j) represent behavioral data fragment u_iAnd u_jBetween similarity, v (u_i) and v (u_j) represent behavior respectively Data slot u_iAnd u_jTheme set of words.When calculating descriptor intersection of sets collection, only when searching motif word is identical and has During identical part of speech, just think that two searching motif words are identical.Can will be high for theme set of words similitude by said method Behavioral data fragment merges, and then reduces the data volume of subsequent data analysis, is conducive to improving holding of Users' Interests Mining Row performance.

3rd, user behavior data cluster analysis is realized

Behavioral data fragment because having similar topic set of words contains similar point of interest, so this module purpose is Analyze the theme word information of all user behavior data fragments using clustering technique, will have like the use of theme word feature vector Family behavioral data fragment clusters out, sorts out the behavioral data set of segments to have similar users point of interest, and then extracts The multidimensional interest point set with discretization of user, and public user interest model is set up with this.

Needed to be calculated the characteristic vector data of user behavior data fragment before using hierarchical clustering algorithm.This Bright first in behavior data fragmentation processing procedure, the theme set of words of the user behavior data fragment obtaining and its part of speech letter Breath.Secondly, calculate the frequency (tf) that each descriptor occurs in each behavioral data fragment respectively, computational methods are in formula 6-1 Be given.Then, calculate the inverse document frequency (idf) of each descriptor, its computing formula is log (d/d_i), wherein d is institute There are the behavioral data fragment number of user, d_iRepresent the number of times that descriptor i occurs in all behavioral data fragments.Respectively will be each The tf value of descriptor obtains the tf-idf value of each descriptor with idf value after being multiplied.Finally, arranged successively according to common words table order Arrange each descriptor and its tf-idf value, thus obtaining the characteristic vector of each behavioral data fragment.This feature vector reflects user The interest characteristics of behavioral data fragment.

After the characteristic vector obtaining each behavioral data fragment, start to execute bottom-up hierarchical clustering algorithm completing to gather Alanysis, its process of cluster analysis is illustrated as shown in Figure 8.

First, each behavioral data fragment is regarded as a classification, in such as Fig. 8, have 30 behavioral data fragments, each Fragment is a classification.Then, the characteristic vector according to each behavioral data fragment, calculates the similarity between them, by phase It is a class like degree two categories combinations of highest.When comprising multiple behavioral data fragment in two classes, using class between each behavior The average similarity of data slot is as the similarity of this two classes.Iteration successively, till specifying class number, ultimately generates Tree clustering result in Fig. 8.

Wherein, the present invention measures the similarity between each behavioral data segment characterizations vector using cosine law formula.False Fixed (x₁,x₂,…,x_n) and (y₁,y₂,…,y_n) vectorial (note: can use as vacancy of the behavioral data segment characterizations for x and y The method of descriptor zero padding, solves block eigenvector length inconsistence problems), then the computing formula of the similarity cos θ of x and y is shown in public affairs Formula 7.

In addition it is also necessary to determine selected which layer conduct after tree clustering result is obtained by bottom-up hierarchical clustering algorithm Final cluster result.Research finds that such is not had too with the similarity of other classes after merging two classification of theme identical Big change.But after merging two different classification of theme, similarity and between other classes can be led to substantially reduce.Divide merging After class, the maximum previous level of similarity change between class, as optimal cluster result.

By above-mentioned cluster analysis, the available one group user behavior data set of segments with similar interests.From each Extract tf-idf value highest descriptor in set as interest topic, and then the interest topic collection of public user can be obtained Close.These interest topics and its behavioral data set of segments are organized together, just establishes the interest model of public user.

4th, the personalized recommendation functional realiey of targeted customer

When accessing electric business platform, user interest has time variation, multi-dimensional nature and discreteness feature.For this situation, Using each point of interest of targeted customer respectively Generalization bounds method, the precision that it is recommended than based on all behavioral data of user Property recommendation results high.But, because user is different to the favorable rating of each point of interest, so each point of interest personalized recommendation knot Weight shared by fruit is also different.For example, certain user interest point s_aComprise 1000 user behaviors, and in addition certain user interest point s_bOnly Only comprise 20 user behaviors, even if now s_aMiddle recommendation results a prediction scoring is slightly below s_bMiddle recommendation results b, but user is to a What degree of liking was possible will be far longer than commodity b.If additionally, point of interest s_aUp-to-date behavior record is than point of interest s_bRemote much, then User equally possible to the degree of liking of commodity b more than commodity a because the interest of user may have occurred that change.So such as What calculates the weight of each point of interest of user, and the recommendation results weighting for each point of interest in proportion, and arrangement obtains final individual character Change recommendation list, be the key obtaining accurately personalized recommendation result.

The present invention, from user interest model, finds targeted customer's each point of interest potential, and extracts these interest All user's score data information that theme is related to, then run user-based collaborative filtering, obtain preliminary each emerging Interest point recommendation results and its prediction scoring p_i.Because user is interested in certain point of interest, the operation row related to this point of interest It is more, so the present invention calculates the power of each point of interest of targeted customer according to the number of user behavior record in each point of interest Weight (λ_i).It is assumed that s_iFor i-th point of interest of targeted customer, len (s_i) point of interest s_iThe behavior record number comprising, then point of interest s_i Weight (λ_i) calculation is as shown in Equation 8.

Can be obtained by the point of interest weight of targeted customer by said method, but user interest also has dynamic time-varying Property, point of interest user more remote is lower to the interest-degree of this point of interest.Draw with reference to German psychologist's Chinese mugwort guest's this research great Forgetting curve, find user point of interest equally meet forgetting curve rule.For this present invention according to user interest point Forgetting law, using forgetting function h (t) to point of interest λ_iWeight is processed.It is assumed that t is last in certain user interest point To the time interval of recommendation time, then the calculation of user interest point memory degree is as shown in Equation 9 for behavior record time of origin.

H (t)=e^-t(formula 9)

Wherein, the unit of t is the moon.When last behavior record time of origin is identical with the time of recommendation, representative is spaced apart 0.Then h (0)=1, represents user and does not have started forgetting to this interest.Finally, the preliminary recommendation results of each point of interest of user are entered Row weighted calculation sorts, and obtains point of interest sorted lists p.It is assumed that targeted customer has n point of interest, then each point of interest sorted lists The calculation of p can be expressed as follows formula 10.

P=sort (p₁*λ₁*h(t₁),p₂*λ₂*h(t₂),p₃*λ₃*h(t₃),…,p_i*λ_i*h(t_i),…,p_n*λ_n*h(t_n)) (public Formula 10)

In formula 6-6, the preliminary recommendation results prediction scoring of each point of interest of sort () function pair targeted customer, interest power Weight, memory degree are weighted, and end value sequence is processed.p_iRepresent the preliminary recommendation knot of this i-th point of interest of user Fruit prediction scoring, t_iRepresent the interval that this point of interest extremely recommends the time, h (t_i) it is the memory degree to point of interest i for the user.Finally, root According to point of interest Sorted list list item calculated value, train value highest point of interest recommendation results are selected to be supplied to targeted customer, thus realizing Consider user's time variation, multi-dimensional nature, the personalized recommendation of the dynamic Characteristic of Interest of discreteness.

Claims

1. a kind of modeling recommendation method based on user behavior data burst cluster is it is characterised in that include:

A. user behavior data customized treatment, specifically includes:

A1. user behavior data collection；When described user behavior data refers to that user passes through internet access electric business platform, electric business The user behavior data that platform is gathered, at least includes logging in, retrieve, browse, buy and evaluating etc. categorical data, simultaneously each Plant the base attribute information that user behavior data all includes the imparting of electric business platform, described base attribute information at least includes session Id, user id, behavior type, content of the act, user ip, logging device and time；

A2. user behavior data burst；Specifically the behavioral data of collection in step a1 is organized by user, then with user Each transaction session to electric business platform is unit, and user behavior data is divided by transaction session, so that each is divided Behavioral data fragment only comprises an affairs theme, and the behavioral data fragment that this user is comprised similar topic word carries out merger Process；Described transaction session refers to create in User logs in electric business platform, and the timeslice destroyed after user terminates to access Section；

B1. after the behavioral data burst to different user for step a, using each behavioral data fragment as a classification, count Calculate the similarity between all categories；Particularly as follows: assuming there is u_iAnd u_jTwo behavioral data fragments, then their theme set of words Similarity s (u_i,u_j) computational methods such as following formula:

s (u_{i}, u_{j}) = \frac{| v (u_{i}) \cap v (u_{j}) |}{\sqrt{| v (u_{i}) | | v (u_{j}) |}}

Wherein, s (u_i,u_j) represent behavioral data fragment u_iAnd u_jBetween similarity, v (u_i) and v (u_j) represent behavioral data respectively Fragment u_iAnd u_jTheme set of words, calculate descriptor intersection of sets collection when, only when searching motif word is identical and has identical During part of speech, just think that two searching motif words are identical；

B2. by two categories combinations of similarity highest of gained be a classification, and using two classifications average similarity As the similarity of new category, repeat step b2 is till obtaining the classification of specified quantity；

B3. extract descriptor from each classification that step b2 finally obtains as interest topic, build public user interest mould Type；

Electric business platform is analyzed to the behavioral data of targeted customer, looks in the public user interest model obtaining from step b Go out each point of interest of targeted customer, tentatively recommended respectively using collaborative filtering, then each in conjunction with targeted customer The weight of point of interest, memory degree and the prediction scoring of preliminary recommendation results carry out consequently recommended it is assumed that i-th point of interest accounts for target and use The weight of family interest is λ_i, the computational methods of this weight are: set s_iFor i-th point of interest of targeted customer, len (s_i) it is interest Point s_iIn targeted customer's behavior record number of comprising, then point of interest s_iAccount for weight λ of targeted customer's interest_iCalculation is as follows Shown in formula:

λ_{i} = \frac{l e n (s_{i})}{l e n (s_{1}) + l e n (s_{2}) + ... + l e n (s_{i}) + ... + l e n (s_{n})}

According to user interest point forgetting law, using forgetting function h (t) to point of interest λ_iWeight is processed；It is assumed that t uses for certain In the point of interest of family, last behavior record time of origin is to the time interval of recommendation time, the then meter of user interest point memory degree Calculation mode is shown below:

H (t)=e^-t

Wherein, the unit of t is the moon；When last behavior record time of origin is identical with the time of recommendation, represents and be spaced apart 0, then h (0)=1, represent user and forgetting is not had started to this interest, finally, the preliminary recommendation results of each point of interest of user are weighted Calculate sequence, obtain point of interest sorted lists p it is assumed that targeted customer has a n point of interest, i-th point of interest recommendation results pre- Test and appraisal are divided into p_i, then the calculation of point of interest sorted lists p can be expressed as follows:

P=sort (p₁*λ₁*h(t₁),p₂*λ₂*h(t₂),p₃*λ₃*h(t₃),…,p_i*λ_i*h(t_i),…,p_n*λ_n*h(t_n))

Wherein, the preliminary recommendation results prediction scoring of each point of interest of sort () function pair targeted customer, interest weight, memory degree enter Row weighted calculation, and end value sequence is processed, p_iRepresent the preliminary recommendation results prediction scoring of this user interest point i, t_iRepresent This point of interest extremely recommends the interval of time, h (t_i) it is the memory degree to point of interest i for the user, finally, according to point of interest sorted lists Item calculated value, selects train value highest point of interest recommendation results to be supplied to targeted customer, thus realizing considering user's time-varying Property, the personalized recommendation of the dynamic Characteristic of Interest of multi-dimensional nature, discreteness.