CN110020189A

CN110020189A - A kind of article recommended method based on Chinese Similarity measures

Info

Publication number: CN110020189A
Application number: CN201810701560.8A
Authority: CN
Inventors: 孙铭鸿
Original assignee: Wuhan Zhangyou Technology Co Ltd
Current assignee: Wuhan Zhangyou Technology Co Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2019-07-16

Abstract

The invention discloses a kind of article recommended method based on Chinese Similarity measures, specific steps include: that the main contents of article are crawled using Python crawlers；Term vector is obtained according to the main contents for crawling article, and is trained；Term vector matrix is converted by article to be recommended；User key words phrase is converted into matrix, then reads the term vector matrix of article conversion obtained in the previous step, and be standardized to term vector matrix data, while carrying out matrix calculating, is arranged according to similarity factor.The present invention provides a kind of article recommended method based on Chinese Similarity measures, Internet user can be helped efficiently to excavate, and article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.

Description

A kind of article recommended method based on Chinese Similarity measures

Technical field

The present invention relates to Internet technical field, more particularly to a kind of article based on Chinese Similarity measures Recommended method.

Background technique

With the continuous development of internet, people's lives habit and life style are undergoing revolutionary variation, interconnect The development of net not only facilitates people's lives, but also also greatly increases the channel that people obtain information.China Internet network Information centre (CNNIC) mentions at " the 36th China Internet network state of development statistical report ", by June, 2015, China Internet news userbase is 5.55 hundred million, wherein cell phone network news userbase is 4.60 hundred million；Internet news is obtained as information The important application of class is taken, utilization rate is only second to instant messaging, comes second.

Under the social background of big data, that user can be allowed logical using Goog l e, Baidu as the search engine of representative Cross the relevant information that input keyword is exactly found oneself needs.But if user can not accurate description meet oneself demand Keyword, search engine can not just play a role.Unlike search engine, recommender system passes through analysis user's The feature of behavior or the contents of a project, thus to find the interested content of user.With major news article information publishing platform The development and growth of (such as wechat public platform), the quantity of article quickly increase, and user is continuous in the difficulty for obtaining article interested It increases, magnanimity article is brought to user also brings very big selection to perplex while extensive information content to user, how to help User efficiently excavates interested article as an information publishing platform major issue urgently to be solved.

Due to lacking enough user interest relevant informations, and the challenge that processing article faces, lead to internet The automatic recommendation effect of upper article is limited, and there are also very big rooms for promotion for similar article proposed algorithm.Article proposed algorithm needs The semantic ambiguity of natural language is coped with using natural language processing technique, syntax obscures, grammer is lack of standardization and word disunity Etc. difficult points, natural language is also converted into the mathematic sign that machine can identify, passes through the means of machine learning and data mining To model, verify.Currently, relevant researchs existing a large amount of for similar article proposed algorithm, such as based on cluster and classification Article recommendation, the article recommendation based on keyword, the recommendation based on specific area hot topic article etc..Although correlative study can be Certain effect is obtained under certain application scenarios, but wherein occur complexity height, smaller scope of application, handmarking's cost The problems such as height, poor recommendation diversity, has limited to the application of article proposed algorithm.

Therefore, how to provide that a kind of that Internet user can be helped efficiently to excavate article interested, the scope of application is larger, artificial It is those skilled in the art that cost is relatively low for label, recommends the diversity preferably article recommended method based on Chinese Similarity measures The problem of urgent need to resolve.

Summary of the invention

In view of this, can help to interconnect the present invention provides a kind of article recommended method based on Chinese Similarity measures Network users efficiently excavate that article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.

To achieve the goals above, the invention provides the following technical scheme:

A kind of article recommended method based on Chinese Similarity measures, specific steps include:

Step 1: the main contents of article are crawled using Python crawlers；

Step 2: obtaining term vector according to the main contents for crawling article, and be trained；

Step 3: converting term vector matrix for article to be recommended；

Step 4: user key words phrase being converted into matrix, then the term vector square of article conversion that read step 3 obtains Battle array, and term vector matrix data is standardized, while carrying out matrix calculating, it is arranged according to similarity factor.

Through the above technical solutions, technical effect of the invention: according to the point of interest of user, recommending the highest text of the degree of correlation Chapter, the algorithm of realization are mainly the calculating of Chinese similitude, and Internet user can be helped efficiently to excavate article interested, be applicable in Range is larger, cost is relatively low by handmarking, recommends diversity preferable.

Preferably, text is crawled in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the step 1 The main contents of chapter specifically include: content of text, head figure and article abstract；The content of text, for generate the word of article to Amount indicates；The head figure, the article that user recommends to user show；The article abstract, is extracted using TextRank algorithm Formula obtains three words of original text, to summarize the main contents of article.

Further, the module for mainly having used Python has a requests, BeautifulSoup and TextRank4Sentence article crawls the content mainly obtained.

Preferably, in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the TextRank algorithm It is based on PageRank, for being text generation keyword and abstract；Entire WWW is a digraph, and node is webpage； There is the link to webpage B in webpage A, the directed edge of webpage B is directed toward from webpage A；After having constructed figure, following formula is used:

S(V_i) be webpage i middle importance (PR value)；D is damped coefficient, is traditionally arranged to be 0.85；In(V_i) it is to exist to refer to To the collections of web pages of the link of webpage i；Out(V_j) be the webpage that the existing link of link in webpage j is directed toward set；|Out (V_j) | it is the number of element in set；The importance of webpage, depending on the sum of the importance to each link page of the webpage.

Through the above technical solutions, the solution have the advantages that: each it is linked to the importance S of the webpage of the page (V_j)S(V_j) page gone out to all pages scoring scoring is also needed, therefore divided by OUT (V_j)；Meanwhile page S (V_j) S(V_j) importance cannot only by others link the pages determine, also comprising certain probability come determine to want receiving by Other pages determine, this namely effect of d.PageRank needs just be tied using above formula successive ignition Fruit, when initial, the importance that each webpage can be set is 1.Above formula left side of the equal sign calculate the result is that webpage i after iteration PR value, before the PR value that right side of the equal sign is used is full iteration.

Preferably, in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the TextRank algorithm, Original text is split as sentence, stop words is filtered out in each sentence, and only retains the word of specified part of speech；Obtain sentence The set of set and word；Each word is as a node in PageRank；Window size is set as k, it is assumed that a sentence Son is successively made of following word: [w₁,w₂,…,w_k]、[w₂,w₃,…,w_k+1]、[w₃,w₄,…,w_k+2] it is a window； There are a undirected sides had no right between the corresponding node of any two words in either window；Based on figure is constituted above, count Calculate the importance of each word node；Most important several words are as keyword；Key phrase is extracted using TextRank Reference；If there is a situation where in original text, several keywords are adjacent, and keyword constitutes a key phrase；For example, at one In the article for introducing " support vector machines ", three keyword supports, vectors, machine can be found, are extracted by key phrase, it can be with Obtain support vector machines.Extract abstract using TextRank and regard each sentence as a node in figure, if two sentences it Between have similitude, it is believed that have a undirected side of having the right between corresponding two nodes, weight is similarity；It is calculated by PageRank The highest several sentences of the importance that method is calculated are as abstract；Two sentence S are calculated using following formula_iAnd S_jSimilarity:

|{w_k|w_k∈S_i&w_k∈S_j| it is the quantity of the word all occurred in two sentences；|S_i| it is the word of sentence i Number；

Due to being authorized graph, the modification of PageRank formula are as follows:

When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0, The weight w of molecule denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,；Inside textrank4zh module Removable three words for extracting original text of TextRank algorithm, this article to be briefly summarized.

It is to be appreciated that: jieba word segmentation module is used in term vector training, and the Word2vec module of Google, oneself is complete The function of neologisms is identified at one.Because jieba participle is bad to three or more Chinese character new word identification effects, such as small trip Play, automatic Pilot, Internet of Things, block chain etc. extract such neologisms using function, and basic principle is after jieba is segmented Two words such as a and b, they are close to, if c=a+b, i.e. a " small " and b " game " are combined into c " trivial games ", if c The number of appearance is more than certain threshold value, then using c as a neologisms, and jieba dictionary is added.

Preferably, specific in the step 2 in a kind of above-mentioned article recommended method based on Chinese Similarity measures Step includes:

Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, is then led to The computing capability for making full use of computer is crossed, finally desired result is calculated；A kind of representation of term vector is one- The representation of hot:

Firstly, count all vocabulary in corpus, then each vocabulary is numbered, for each word establish V dimension to Amount, each dimension of vector indicate a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0；

Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration；Institute State the method based on singular value decomposition:

Word-document matrix

The matrix for establishing word and document, by doing singular value decomposition to this matrix, the vector for obtaining word is indicated；

B, word, word matrix

Contextual window is set, statistics is established the co-occurrence matrix between word and word, obtained by doing singular value decomposition to matrix Obtain term vector；

The method based on iteration: specific formalization representation is as follows:

One gram language model, it is assumed that the probability of current word is related with current word,

Two gram language models, it is assumed that the probability of current word is related with previous word,

a、Continuous Bag ofWords Model

The probability distribution of given context-prediction target word, first objective function, it is excellent then by gradient descent method Change this neural network；

Objective function is using intersection entropy function:

Due to y_jIt is the representation of one-hot, only works as y_jWhen=i, objective function is not 0；Objective function becomes:

The calculation formula of predicted value is substituted into, objective function can convert are as follows:

b、Skip-Gram Model

Skip-gram model is the probability value of given target word prediction context, sets an objective function, then adopts Optimal parameter solution is found with optimization method, objective function is as follows:

The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, is utilized Hierarchical softmax and Negative sampling two methods are trained the model.

It preferably, will be in a kind of above-mentioned article recommended method based on Chinese Similarity measures, in the step 3 The article of recommendation is converted into term vector matrix, and is standardized to matrix data, by one group of vector table of an article Show, and generate intermediate file data, carry out keyword extraction based on TF-IDF algorithm: word frequency-inverse file frequency is examined for information The common weighting technique that rope and information are prospected；A words is assessed for a copy of it text in a file set or a corpus The significance level of part；

Word frequency is the number that some given word occurs in this document: TF_w=entry w occurs in certain one kind Number/such in all entry number；

Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by comprising the word it The number of file, then take logarithm to obtain formula obtained quotient:

IDF=log (total number of documents of the corpus/number of files+1 comprising entry w)；

TF-IDF=TF*IDF；

Carry out vectorization expression by obtaining the keyword of every article, then to these keywords, finally merge these to The vector that amount then obtains this article indicates.

Preferably, it in a kind of above-mentioned article recommended method based on Chinese Similarity measures, will be used in the step 4 Family keyword phrases are converted to one group of matrix, then read the article matrix that the step 3 obtains, and carry out matrix calculating, thus To a column data, it is ranked up according to similarity factor；The standardization formula is as follows:

That is n is standardized vector, its mould is 1, as unit vector；Two n tie up sample point a (x₁₁,x₁₂,…,x_1n) and b(x₂₁,x₂₂,…,x_2n) included angle cosine measure the similitude between a and b；

That is:

Obtained cos (θ) is likeness coefficient；The matrix A of the vector composition of article, the keyword vector B of user, phase Like property coefficient vector C=A*B, C (c₁, c₂..., c_n) length be article quantity, wherein n is that article to be recommended is total Number, c₁As number the similarity factor of the article and user key words that are 1；By article vector sum user input keyword to The size of the likeness coefficient of amount is recommended article to user, and the bigger article override of likeness coefficient is recommended.

It can be seen via above technical scheme that compared with prior art, the present disclosure provides one kind based on Chinese phase Like the article recommended method that property calculates, Internet user can be helped efficiently to excavate, and article interested, the scope of application is larger, artificial mark Cost is relatively low for note, recommends diversity preferable.The main contents of article are crawled first with Python crawlers；Further according to crawling The main contents of article obtain term vector, and are trained；Then term vector matrix is converted by article to be recommended；Finally will User key words phrase is converted to one group of matrix, then reads the term vector matrix of article conversion, carries out matrix calculating, thus obtains One column data, is ranked up according to similarity factor, carries out article recommendation to user.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 attached drawing is flow chart of the invention；

Fig. 2 attached drawing is CBOW model structure schematic diagram of the invention；

Fig. 3 attached drawing is skip-gram model structure schematic diagram of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention 1-3, technical solution in the embodiment of the present invention carry out it is clear, It is fully described by, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a kind of article recommended methods based on Chinese Similarity measures, internet can be helped to use Family efficiently excavates that article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.

As shown in Figure 1, a kind of article recommended method based on Chinese Similarity measures, specific steps include:

Step 1: the main contents of article are crawled using Python crawlers；

Step 3: converting term vector matrix for article to be recommended；

In order to further optimize the above technical scheme, the main contents that article is crawled in step 1 specifically include: in text Appearance, head figure and article abstract；Content of text, the term vector for generating article indicate；Head figure, the article that user recommends to user Show；Article abstract, using removable three words for obtaining original text of TextRank algorithm, to summarize the main interior of article Hold.

In order to further optimize the above technical scheme, TextRank algorithm is based on PageRank, for being text generation Keyword and abstract；Entire WWW is a digraph, and node is webpage；There is the link to webpage B in webpage A, from webpage A It is directed toward the directed edge of webpage B；After having constructed figure, following formula is used:

S(V_i) be webpage i middle importance；D is damped coefficient, is traditionally arranged to be 0.85；In(V_i) it is to exist to be directed toward webpage The collections of web pages of the link of i；Out(V_j) be the webpage that the existing link of link in webpage j is directed toward set；|Out(V_j) | it is The number of element in set；The importance of webpage, depending on the sum of the importance to each link page of the webpage.

In order to further optimize the above technical scheme, original text is split as sentence by TextRank algorithm, in each sentence In filter out stop words, and only retain the word of specified part of speech；Obtain the set of sentence and the set of word；Each word conduct A node in PageRank；Window size is set as k, it is assumed that a sentence is successively made of following word: [w₁, w₂,…,w_k]、[w₂,w₃,…,w_k+1]、[w₃,w₄,…,w_k+2] it is a window；Any two words pair in either window There are a undirected sides had no right between the node answered；Based on figure is constituted above, the importance of each word node is calculated；Most Important several words are as keyword；Key phrase reference is extracted using TextRank；If there are several passes in original text The adjacent situation of keyword, keyword constitute a key phrase；Abstract is extracted using TextRank to regard each sentence in figure as A node, if having similitude between two sentences, it is believed that have a undirected side of having the right, weight between corresponding two nodes It is similarity；The highest several sentences of the importance being calculated by PageRank algorithm are as abstract；It is calculated using following formula Two sentence S_iAnd S_jSimilarity:

In order to further optimize the above technical scheme, specific steps include: in step 2

Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration；Base In the method for singular value decomposition:

Word-document matrix

B, word, word matrix

Method based on iteration: specific formalization representation is as follows:

a、Continuous Bag ofWords Model(CBOW)

As shown in Fig. 2, the probability distribution of given context-prediction target word, first objective function, then pass through ladder Descent method is spent, this neural network is optimized；

Objective function is using intersection entropy function:

b、Skip-Gram Model

As shown in figure 3, skip-gram model is the probability value of given target word prediction context, a target letter is set Number then finds optimal parameter solution using optimization method, and objective function is as follows:

The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, when reducing training Between, the model is trained using Hierarchical softmax and Negative sampling two methods.

In order to further optimize the above technical scheme, term vector matrix is converted by article to be recommended in step 3, and right Matrix data is standardized, and an article is indicated with one group of vector, and generate intermediate file data, and TF-IDF is based on Algorithm carries out keyword extraction:

The common weighting technique that word frequency-inverse file frequency is prospected for information retrieval and information；A words is assessed for one The significance level of a file set or a copy of it file in a corpus；

Word frequency is the number that some given word occurs in this document: TFw=entry w in certain one kind occurs Number/such in all entry number；Some general words for theme there is no too big effect, though it is some go out The less word of existing frequency can express the theme of article, so simple use is TF inappropriate.The design of weight must expire Foot: the ability of a word prediction theme is stronger, and weight is bigger, conversely, weight is smaller.In the article of all statistics, some words are only It is to occur in wherein seldom several articles, then such word is very big to the effect of the theme of article, the weight of these words is answered The design it is larger.

IDF=log (total number of documents of the corpus/number of files+1 comprising entry w)；Why denominator will add 1, be in order to Avoiding denominator is the low document-frequency of high term frequencies and the word in entire file set in 0 a certain specific file, It can produce out the TF-IDF of high weight.

TF-IDF=TF*IDF；

In order to further optimize the above technical scheme, user key words phrase is converted into one group of matrix in step 4, then read The article matrix for taking step 3 to obtain carries out matrix calculating, thus obtains a column data, be ranked up according to similarity factor；Standard It is as follows to change processing formula:

That is:

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of article recommended method based on Chinese Similarity measures, which is characterized in that specific steps include:

Step 1: the main contents of article are crawled using Python crawlers；

Step 3: converting term vector matrix for article to be recommended；

Step 4: user key words phrase is converted into matrix, then the term vector matrix of article conversion that read step 3 obtains, and Term vector matrix data is standardized, while carrying out matrix calculating, is arranged according to similarity factor.

2. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described The main contents that article is crawled in step 1 specifically include: content of text, head figure and article abstract；The content of text, for giving birth to It is indicated at the term vector of article；The head figure, the article that user recommends to user show；The article abstract uses Removable three words for obtaining original text of TextRank algorithm, to summarize the main contents of article.

3. a kind of article recommended method based on Chinese Similarity measures according to claim 2, which is characterized in that described TextRank algorithm is based on PageRank, for being text generation keyword and abstract；Entire WWW is a digraph, Node is webpage；There is the link to webpage B in webpage A, the directed edge of webpage B is directed toward from webpage A；After having constructed figure, under use The formula in face:

S(V_i) be webpage i middle importance；D is damped coefficient, is traditionally arranged to be 0.85；In(V_i) it is to exist to be directed toward webpage i The collections of web pages of link；Out(V_j) be the webpage that the existing link of link in webpage j is directed toward set；|Out(V_j) | it is collection The number of element in conjunction；The importance of webpage, depending on the sum of the importance to each link page of the webpage.

4. a kind of article recommended method based on Chinese Similarity measures according to claim 3, which is characterized in that described Original text is split as sentence by TextRank algorithm, and stop words is filtered out in each sentence, and only retains the list of specified part of speech Word；Obtain the set of sentence and the set of word；Each word is as a node in PageRank；Set window size as K a, it is assumed that sentence is successively made of following word: [w₁,w₂,…,w_k]、[w₂,w₃,…,w_k+1]、[w₃,w₄,…,w_k+2] It is a window；There are a undirected sides had no right between the corresponding node of any two words in either window；It is based on Figure is constituted above, calculates the importance of each word node；Most important several words are as keyword；Use TextRank Extract key phrase reference；If there is a situation where in original text, several keywords are adjacent, and keyword constitutes a key phrase； Abstract is extracted using TextRank and regards each sentence as a node in figure, if having similitude between two sentences, it is believed that There is a undirected side of having the right between corresponding two nodes, weight is similarity；The weight being calculated by PageRank algorithm The highest several sentences of the property wanted are as abstract；Two sentence S are calculated using following formula_iAnd S_jSimilarity:

|{ω_k|ω_k∈S_i&ω_k∈S_j| it is the quantity of the word all occurred in two sentences；|S_i| it is the word of sentence i Number；

When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0, molecule The weight w of denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,；Pass through the TextRank inside textrank4zh module Removable three words for extracting original text of algorithm, this article to be briefly summarized.

5. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described Specific steps include: in step 2

Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, then by filling Divide the computing capability using computer, finally desired result is calculated；A kind of representation of term vector is one-hot Representation:

Firstly, counting all vocabulary in corpus, then each vocabulary is numbered, the vector of V dimension is established for each word, to Each dimension of amount indicates a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0；

Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration；The base In the method for singular value decomposition:

Word-document matrix

B, word, word matrix

Contextual window is set, statistics establishes the co-occurrence matrix between word and word, obtains word by doing singular value decomposition to matrix Vector；

a、Continuous Bag of Words Model

The probability distribution of given context-prediction target word, first objective function optimize this then by gradient descent method Neural network；

Objective function is using intersection entropy function:

b、Skip-Gram Model

Skip-gram model is the probability value of given target word prediction context, an objective function is set, then using excellent Change method finds optimal parameter solution, and objective function is as follows:

6. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described Term vector matrix is converted by article to be recommended in step 3, and matrix data is standardized, an article is used One group of vector indicates, and generates intermediate file data, carries out keyword extraction based on TF-IDF algorithm:

The common weighting technique that word frequency-inverse file frequency is prospected for information retrieval and information；A words is assessed for a text The significance level of part collection or a copy of it file in a corpus；

Word frequency is the number that some given word occurs in this document: TF_w=the number that entry w occurs in certain one kind/ All entry numbers in such；

Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by the file comprising the word Number, then take logarithm to obtain formula obtained quotient:

TF-IDF=TF*IDF；

Vectorization expression is carried out by obtaining the keyword of every article, then to these keywords, finally merges these vectors then The vector for obtaining this article indicates.

7. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described User key words phrase is converted into one group of matrix in step 4, then reads the article matrix that the step 3 obtains, carries out matrix It calculates, thus obtains a column data, be ranked up according to similarity factor；The standardization formula is as follows:

That is n is standardized vector, its mould is 1, as unit vector；Two n tie up sample point a (x₁₁,x₁₂,…,x_1n) and b (x₂₁,x₂₂,…,x_2n) included angle cosine measure the similitude between a and b；

That is:

Obtained cos (θ) is likeness coefficient；The matrix A of the vector composition of article, the keyword vector B of user, similitude Coefficient vector C=A*B, C (c₁, c₂..., c_n) length be article quantity, wherein n is article sum to be recommended, c₁ As number the similarity factor of the article and user key words that are 1；The crucial term vector inputted by article vector sum user The size of likeness coefficient is recommended article to user, and the bigger article override of likeness coefficient is recommended.