CN110020189A - A kind of article recommended method based on Chinese Similarity measures - Google Patents
A kind of article recommended method based on Chinese Similarity measures Download PDFInfo
- Publication number
- CN110020189A CN110020189A CN201810701560.8A CN201810701560A CN110020189A CN 110020189 A CN110020189 A CN 110020189A CN 201810701560 A CN201810701560 A CN 201810701560A CN 110020189 A CN110020189 A CN 110020189A
- Authority
- CN
- China
- Prior art keywords
- article
- word
- matrix
- vector
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000011524 similarity measure Methods 0.000 title claims abstract description 23
- 239000013598 vector Substances 0.000 claims abstract description 70
- 239000011159 matrix material Substances 0.000 claims abstract description 58
- 238000006243 chemical reaction Methods 0.000 claims abstract description 5
- 230000009193 crawling Effects 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 25
- 238000000354 decomposition reaction Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 5
- 238000010801 machine learning Methods 0.000 claims description 4
- 238000012986 modification Methods 0.000 claims description 4
- 230000004048 modification Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 2
- 230000032696 parturition Effects 0.000 claims 1
- 230000000694 effects Effects 0.000 description 7
- 238000011161 development Methods 0.000 description 4
- 230000018109 developmental process Effects 0.000 description 4
- 206010028916 Neologism Diseases 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of article recommended method based on Chinese Similarity measures, specific steps include: that the main contents of article are crawled using Python crawlers;Term vector is obtained according to the main contents for crawling article, and is trained;Term vector matrix is converted by article to be recommended;User key words phrase is converted into matrix, then reads the term vector matrix of article conversion obtained in the previous step, and be standardized to term vector matrix data, while carrying out matrix calculating, is arranged according to similarity factor.The present invention provides a kind of article recommended method based on Chinese Similarity measures, Internet user can be helped efficiently to excavate, and article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.
Description
Technical field
The present invention relates to Internet technical field, more particularly to a kind of article based on Chinese Similarity measures
Recommended method.
Background technique
With the continuous development of internet, people's lives habit and life style are undergoing revolutionary variation, interconnect
The development of net not only facilitates people's lives, but also also greatly increases the channel that people obtain information.China Internet network
Information centre (CNNIC) mentions at " the 36th China Internet network state of development statistical report ", by June, 2015, China
Internet news userbase is 5.55 hundred million, wherein cell phone network news userbase is 4.60 hundred million;Internet news is obtained as information
The important application of class is taken, utilization rate is only second to instant messaging, comes second.
Under the social background of big data, that user can be allowed logical using Goog l e, Baidu as the search engine of representative
Cross the relevant information that input keyword is exactly found oneself needs.But if user can not accurate description meet oneself demand
Keyword, search engine can not just play a role.Unlike search engine, recommender system passes through analysis user's
The feature of behavior or the contents of a project, thus to find the interested content of user.With major news article information publishing platform
The development and growth of (such as wechat public platform), the quantity of article quickly increase, and user is continuous in the difficulty for obtaining article interested
It increases, magnanimity article is brought to user also brings very big selection to perplex while extensive information content to user, how to help
User efficiently excavates interested article as an information publishing platform major issue urgently to be solved.
Due to lacking enough user interest relevant informations, and the challenge that processing article faces, lead to internet
The automatic recommendation effect of upper article is limited, and there are also very big rooms for promotion for similar article proposed algorithm.Article proposed algorithm needs
The semantic ambiguity of natural language is coped with using natural language processing technique, syntax obscures, grammer is lack of standardization and word disunity
Etc. difficult points, natural language is also converted into the mathematic sign that machine can identify, passes through the means of machine learning and data mining
To model, verify.Currently, relevant researchs existing a large amount of for similar article proposed algorithm, such as based on cluster and classification
Article recommendation, the article recommendation based on keyword, the recommendation based on specific area hot topic article etc..Although correlative study can be
Certain effect is obtained under certain application scenarios, but wherein occur complexity height, smaller scope of application, handmarking's cost
The problems such as height, poor recommendation diversity, has limited to the application of article proposed algorithm.
Therefore, how to provide that a kind of that Internet user can be helped efficiently to excavate article interested, the scope of application is larger, artificial
It is those skilled in the art that cost is relatively low for label, recommends the diversity preferably article recommended method based on Chinese Similarity measures
The problem of urgent need to resolve.
Summary of the invention
In view of this, can help to interconnect the present invention provides a kind of article recommended method based on Chinese Similarity measures
Network users efficiently excavate that article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.
To achieve the goals above, the invention provides the following technical scheme:
A kind of article recommended method based on Chinese Similarity measures, specific steps include:
Step 1: the main contents of article are crawled using Python crawlers;
Step 2: obtaining term vector according to the main contents for crawling article, and be trained;
Step 3: converting term vector matrix for article to be recommended;
Step 4: user key words phrase being converted into matrix, then the term vector square of article conversion that read step 3 obtains
Battle array, and term vector matrix data is standardized, while carrying out matrix calculating, it is arranged according to similarity factor.
Through the above technical solutions, technical effect of the invention: according to the point of interest of user, recommending the highest text of the degree of correlation
Chapter, the algorithm of realization are mainly the calculating of Chinese similitude, and Internet user can be helped efficiently to excavate article interested, be applicable in
Range is larger, cost is relatively low by handmarking, recommends diversity preferable.
Preferably, text is crawled in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the step 1
The main contents of chapter specifically include: content of text, head figure and article abstract;The content of text, for generate the word of article to
Amount indicates;The head figure, the article that user recommends to user show;The article abstract, is extracted using TextRank algorithm
Formula obtains three words of original text, to summarize the main contents of article.
Further, the module for mainly having used Python has a requests, BeautifulSoup and
TextRank4Sentence article crawls the content mainly obtained.
Preferably, in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the TextRank algorithm
It is based on PageRank, for being text generation keyword and abstract;Entire WWW is a digraph, and node is webpage;
There is the link to webpage B in webpage A, the directed edge of webpage B is directed toward from webpage A;After having constructed figure, following formula is used:
S(Vi) be webpage i middle importance (PR value);D is damped coefficient, is traditionally arranged to be 0.85;In(Vi) it is to exist to refer to
To the collections of web pages of the link of webpage i;Out(Vj) be the webpage that the existing link of link in webpage j is directed toward set;|Out
(Vj) | it is the number of element in set;The importance of webpage, depending on the sum of the importance to each link page of the webpage.
Through the above technical solutions, the solution have the advantages that: each it is linked to the importance S of the webpage of the page
(Vj)S(Vj) page gone out to all pages scoring scoring is also needed, therefore divided by OUT (Vj);Meanwhile page S (Vj)
S(Vj) importance cannot only by others link the pages determine, also comprising certain probability come determine to want receiving by
Other pages determine, this namely effect of d.PageRank needs just be tied using above formula successive ignition
Fruit, when initial, the importance that each webpage can be set is 1.Above formula left side of the equal sign calculate the result is that webpage i after iteration
PR value, before the PR value that right side of the equal sign is used is full iteration.
Preferably, in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the TextRank algorithm,
Original text is split as sentence, stop words is filtered out in each sentence, and only retains the word of specified part of speech;Obtain sentence
The set of set and word;Each word is as a node in PageRank;Window size is set as k, it is assumed that a sentence
Son is successively made of following word: [w1,w2,…,wk]、[w2,w3,…,wk+1]、[w3,w4,…,wk+2] it is a window;
There are a undirected sides had no right between the corresponding node of any two words in either window;Based on figure is constituted above, count
Calculate the importance of each word node;Most important several words are as keyword;Key phrase is extracted using TextRank
Reference;If there is a situation where in original text, several keywords are adjacent, and keyword constitutes a key phrase;For example, at one
In the article for introducing " support vector machines ", three keyword supports, vectors, machine can be found, are extracted by key phrase, it can be with
Obtain support vector machines.Extract abstract using TextRank and regard each sentence as a node in figure, if two sentences it
Between have similitude, it is believed that have a undirected side of having the right between corresponding two nodes, weight is similarity;It is calculated by PageRank
The highest several sentences of the importance that method is calculated are as abstract;Two sentence S are calculated using following formulaiAnd SjSimilarity:
|{wk|wk∈Si&wk∈Sj| it is the quantity of the word all occurred in two sentences;|Si| it is the word of sentence i
Number;
Due to being authorized graph, the modification of PageRank formula are as follows:
When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0,
The weight w of molecule denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,;Inside textrank4zh module
Removable three words for extracting original text of TextRank algorithm, this article to be briefly summarized.
It is to be appreciated that: jieba word segmentation module is used in term vector training, and the Word2vec module of Google, oneself is complete
The function of neologisms is identified at one.Because jieba participle is bad to three or more Chinese character new word identification effects, such as small trip
Play, automatic Pilot, Internet of Things, block chain etc. extract such neologisms using function, and basic principle is after jieba is segmented
Two words such as a and b, they are close to, if c=a+b, i.e. a " small " and b " game " are combined into c " trivial games ", if c
The number of appearance is more than certain threshold value, then using c as a neologisms, and jieba dictionary is added.
Preferably, specific in the step 2 in a kind of above-mentioned article recommended method based on Chinese Similarity measures
Step includes:
Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, is then led to
The computing capability for making full use of computer is crossed, finally desired result is calculated;A kind of representation of term vector is one-
The representation of hot:
Firstly, count all vocabulary in corpus, then each vocabulary is numbered, for each word establish V dimension to
Amount, each dimension of vector indicate a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0;
Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration;Institute
State the method based on singular value decomposition:
Word-document matrix
The matrix for establishing word and document, by doing singular value decomposition to this matrix, the vector for obtaining word is indicated;
B, word, word matrix
Contextual window is set, statistics is established the co-occurrence matrix between word and word, obtained by doing singular value decomposition to matrix
Obtain term vector;
The method based on iteration: specific formalization representation is as follows:
One gram language model, it is assumed that the probability of current word is related with current word,
Two gram language models, it is assumed that the probability of current word is related with previous word,
a、Continuous Bag ofWords Model
The probability distribution of given context-prediction target word, first objective function, it is excellent then by gradient descent method
Change this neural network;
Objective function is using intersection entropy function:
Due to yjIt is the representation of one-hot, only works as yjWhen=i, objective function is not 0;Objective function becomes:
The calculation formula of predicted value is substituted into, objective function can convert are as follows:
b、Skip-Gram Model
Skip-gram model is the probability value of given target word prediction context, sets an objective function, then adopts
Optimal parameter solution is found with optimization method, objective function is as follows:
The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, is utilized
Hierarchical softmax and Negative sampling two methods are trained the model.
It preferably, will be in a kind of above-mentioned article recommended method based on Chinese Similarity measures, in the step 3
The article of recommendation is converted into term vector matrix, and is standardized to matrix data, by one group of vector table of an article
Show, and generate intermediate file data, carry out keyword extraction based on TF-IDF algorithm: word frequency-inverse file frequency is examined for information
The common weighting technique that rope and information are prospected;A words is assessed for a copy of it text in a file set or a corpus
The significance level of part;
Word frequency is the number that some given word occurs in this document: TFw=entry w occurs in certain one kind
Number/such in all entry number;
Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by comprising the word it
The number of file, then take logarithm to obtain formula obtained quotient:
IDF=log (total number of documents of the corpus/number of files+1 comprising entry w);
TF-IDF=TF*IDF;
Carry out vectorization expression by obtaining the keyword of every article, then to these keywords, finally merge these to
The vector that amount then obtains this article indicates.
Preferably, it in a kind of above-mentioned article recommended method based on Chinese Similarity measures, will be used in the step 4
Family keyword phrases are converted to one group of matrix, then read the article matrix that the step 3 obtains, and carry out matrix calculating, thus
To a column data, it is ranked up according to similarity factor;The standardization formula is as follows:
That is n is standardized vector, its mould is 1, as unit vector;Two n tie up sample point a (x11,x12,…,x1n) and
b(x21,x22,…,x2n) included angle cosine measure the similitude between a and b;
That is:
Obtained cos (θ) is likeness coefficient;The matrix A of the vector composition of article, the keyword vector B of user, phase
Like property coefficient vector C=A*B, C (c1, c2..., cn) length be article quantity, wherein n is that article to be recommended is total
Number, c1As number the similarity factor of the article and user key words that are 1;By article vector sum user input keyword to
The size of the likeness coefficient of amount is recommended article to user, and the bigger article override of likeness coefficient is recommended.
It can be seen via above technical scheme that compared with prior art, the present disclosure provides one kind based on Chinese phase
Like the article recommended method that property calculates, Internet user can be helped efficiently to excavate, and article interested, the scope of application is larger, artificial mark
Cost is relatively low for note, recommends diversity preferable.The main contents of article are crawled first with Python crawlers;Further according to crawling
The main contents of article obtain term vector, and are trained;Then term vector matrix is converted by article to be recommended;Finally will
User key words phrase is converted to one group of matrix, then reads the term vector matrix of article conversion, carries out matrix calculating, thus obtains
One column data, is ranked up according to similarity factor, carries out article recommendation to user.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 attached drawing is flow chart of the invention;
Fig. 2 attached drawing is CBOW model structure schematic diagram of the invention;
Fig. 3 attached drawing is skip-gram model structure schematic diagram of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention 1-3, technical solution in the embodiment of the present invention carry out it is clear,
It is fully described by, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
His embodiment, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a kind of article recommended methods based on Chinese Similarity measures, internet can be helped to use
Family efficiently excavates that article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.
As shown in Figure 1, a kind of article recommended method based on Chinese Similarity measures, specific steps include:
Step 1: the main contents of article are crawled using Python crawlers;
Step 2: obtaining term vector according to the main contents for crawling article, and be trained;
Step 3: converting term vector matrix for article to be recommended;
Step 4: user key words phrase being converted into matrix, then the term vector square of article conversion that read step 3 obtains
Battle array, and term vector matrix data is standardized, while carrying out matrix calculating, it is arranged according to similarity factor.
In order to further optimize the above technical scheme, the main contents that article is crawled in step 1 specifically include: in text
Appearance, head figure and article abstract;Content of text, the term vector for generating article indicate;Head figure, the article that user recommends to user
Show;Article abstract, using removable three words for obtaining original text of TextRank algorithm, to summarize the main interior of article
Hold.
In order to further optimize the above technical scheme, TextRank algorithm is based on PageRank, for being text generation
Keyword and abstract;Entire WWW is a digraph, and node is webpage;There is the link to webpage B in webpage A, from webpage A
It is directed toward the directed edge of webpage B;After having constructed figure, following formula is used:
S(Vi) be webpage i middle importance;D is damped coefficient, is traditionally arranged to be 0.85;In(Vi) it is to exist to be directed toward webpage
The collections of web pages of the link of i;Out(Vj) be the webpage that the existing link of link in webpage j is directed toward set;|Out(Vj) | it is
The number of element in set;The importance of webpage, depending on the sum of the importance to each link page of the webpage.
In order to further optimize the above technical scheme, original text is split as sentence by TextRank algorithm, in each sentence
In filter out stop words, and only retain the word of specified part of speech;Obtain the set of sentence and the set of word;Each word conduct
A node in PageRank;Window size is set as k, it is assumed that a sentence is successively made of following word: [w1,
w2,…,wk]、[w2,w3,…,wk+1]、[w3,w4,…,wk+2] it is a window;Any two words pair in either window
There are a undirected sides had no right between the node answered;Based on figure is constituted above, the importance of each word node is calculated;Most
Important several words are as keyword;Key phrase reference is extracted using TextRank;If there are several passes in original text
The adjacent situation of keyword, keyword constitute a key phrase;Abstract is extracted using TextRank to regard each sentence in figure as
A node, if having similitude between two sentences, it is believed that have a undirected side of having the right, weight between corresponding two nodes
It is similarity;The highest several sentences of the importance being calculated by PageRank algorithm are as abstract;It is calculated using following formula
Two sentence SiAnd SjSimilarity:
|{wk|wk∈Si&wk∈Sj| it is the quantity of the word all occurred in two sentences;|Si| it is the word of sentence i
Number;
Due to being authorized graph, the modification of PageRank formula are as follows:
When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0,
The weight w of molecule denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,;Inside textrank4zh module
Removable three words for extracting original text of TextRank algorithm, this article to be briefly summarized.
In order to further optimize the above technical scheme, specific steps include: in step 2
Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, is then led to
The computing capability for making full use of computer is crossed, finally desired result is calculated;A kind of representation of term vector is one-
The representation of hot:
Firstly, count all vocabulary in corpus, then each vocabulary is numbered, for each word establish V dimension to
Amount, each dimension of vector indicate a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0;
Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration;Base
In the method for singular value decomposition:
Word-document matrix
The matrix for establishing word and document, by doing singular value decomposition to this matrix, the vector for obtaining word is indicated;
B, word, word matrix
Contextual window is set, statistics is established the co-occurrence matrix between word and word, obtained by doing singular value decomposition to matrix
Obtain term vector;
Method based on iteration: specific formalization representation is as follows:
One gram language model, it is assumed that the probability of current word is related with current word,
Two gram language models, it is assumed that the probability of current word is related with previous word,
a、Continuous Bag ofWords Model(CBOW)
As shown in Fig. 2, the probability distribution of given context-prediction target word, first objective function, then pass through ladder
Descent method is spent, this neural network is optimized;
Objective function is using intersection entropy function:
Due to yjIt is the representation of one-hot, only works as yjWhen=i, objective function is not 0;Objective function becomes:
The calculation formula of predicted value is substituted into, objective function can convert are as follows:
b、Skip-Gram Model
As shown in figure 3, skip-gram model is the probability value of given target word prediction context, a target letter is set
Number then finds optimal parameter solution using optimization method, and objective function is as follows:
The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, when reducing training
Between, the model is trained using Hierarchical softmax and Negative sampling two methods.
In order to further optimize the above technical scheme, term vector matrix is converted by article to be recommended in step 3, and right
Matrix data is standardized, and an article is indicated with one group of vector, and generate intermediate file data, and TF-IDF is based on
Algorithm carries out keyword extraction:
The common weighting technique that word frequency-inverse file frequency is prospected for information retrieval and information;A words is assessed for one
The significance level of a file set or a copy of it file in a corpus;
Word frequency is the number that some given word occurs in this document: TFw=entry w in certain one kind occurs
Number/such in all entry number;Some general words for theme there is no too big effect, though it is some go out
The less word of existing frequency can express the theme of article, so simple use is TF inappropriate.The design of weight must expire
Foot: the ability of a word prediction theme is stronger, and weight is bigger, conversely, weight is smaller.In the article of all statistics, some words are only
It is to occur in wherein seldom several articles, then such word is very big to the effect of the theme of article, the weight of these words is answered
The design it is larger.
Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by comprising the word it
The number of file, then take logarithm to obtain formula obtained quotient:
IDF=log (total number of documents of the corpus/number of files+1 comprising entry w);Why denominator will add 1, be in order to
Avoiding denominator is the low document-frequency of high term frequencies and the word in entire file set in 0 a certain specific file,
It can produce out the TF-IDF of high weight.
TF-IDF=TF*IDF;
Carry out vectorization expression by obtaining the keyword of every article, then to these keywords, finally merge these to
The vector that amount then obtains this article indicates.
In order to further optimize the above technical scheme, user key words phrase is converted into one group of matrix in step 4, then read
The article matrix for taking step 3 to obtain carries out matrix calculating, thus obtains a column data, be ranked up according to similarity factor;Standard
It is as follows to change processing formula:
That is n is standardized vector, its mould is 1, as unit vector;Two n tie up sample point a (x11,x12,…,x1n) and
b(x21,x22,…,x2n) included angle cosine measure the similitude between a and b;
That is:
Obtained cos (θ) is likeness coefficient;The matrix A of the vector composition of article, the keyword vector B of user, phase
Like property coefficient vector C=A*B, C (c1, c2..., cn) length be article quantity, wherein n is that article to be recommended is total
Number, c1As number the similarity factor of the article and user key words that are 1;By article vector sum user input keyword to
The size of the likeness coefficient of amount is recommended article to user, and the bigger article override of likeness coefficient is recommended.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (7)
1. a kind of article recommended method based on Chinese Similarity measures, which is characterized in that specific steps include:
Step 1: the main contents of article are crawled using Python crawlers;
Step 2: obtaining term vector according to the main contents for crawling article, and be trained;
Step 3: converting term vector matrix for article to be recommended;
Step 4: user key words phrase is converted into matrix, then the term vector matrix of article conversion that read step 3 obtains, and
Term vector matrix data is standardized, while carrying out matrix calculating, is arranged according to similarity factor.
2. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described
The main contents that article is crawled in step 1 specifically include: content of text, head figure and article abstract;The content of text, for giving birth to
It is indicated at the term vector of article;The head figure, the article that user recommends to user show;The article abstract uses
Removable three words for obtaining original text of TextRank algorithm, to summarize the main contents of article.
3. a kind of article recommended method based on Chinese Similarity measures according to claim 2, which is characterized in that described
TextRank algorithm is based on PageRank, for being text generation keyword and abstract;Entire WWW is a digraph,
Node is webpage;There is the link to webpage B in webpage A, the directed edge of webpage B is directed toward from webpage A;After having constructed figure, under use
The formula in face:
S(Vi) be webpage i middle importance;D is damped coefficient, is traditionally arranged to be 0.85;In(Vi) it is to exist to be directed toward webpage i
The collections of web pages of link;Out(Vj) be the webpage that the existing link of link in webpage j is directed toward set;|Out(Vj) | it is collection
The number of element in conjunction;The importance of webpage, depending on the sum of the importance to each link page of the webpage.
4. a kind of article recommended method based on Chinese Similarity measures according to claim 3, which is characterized in that described
Original text is split as sentence by TextRank algorithm, and stop words is filtered out in each sentence, and only retains the list of specified part of speech
Word;Obtain the set of sentence and the set of word;Each word is as a node in PageRank;Set window size as
K a, it is assumed that sentence is successively made of following word: [w1,w2,…,wk]、[w2,w3,…,wk+1]、[w3,w4,…,wk+2]
It is a window;There are a undirected sides had no right between the corresponding node of any two words in either window;It is based on
Figure is constituted above, calculates the importance of each word node;Most important several words are as keyword;Use TextRank
Extract key phrase reference;If there is a situation where in original text, several keywords are adjacent, and keyword constitutes a key phrase;
Abstract is extracted using TextRank and regards each sentence as a node in figure, if having similitude between two sentences, it is believed that
There is a undirected side of having the right between corresponding two nodes, weight is similarity;The weight being calculated by PageRank algorithm
The highest several sentences of the property wanted are as abstract;Two sentence S are calculated using following formulaiAnd SjSimilarity:
|{ωk|ωk∈Si&ωk∈Sj| it is the quantity of the word all occurred in two sentences;|Si| it is the word of sentence i
Number;
Due to being authorized graph, the modification of PageRank formula are as follows:
When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0, molecule
The weight w of denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,;Pass through the TextRank inside textrank4zh module
Removable three words for extracting original text of algorithm, this article to be briefly summarized.
5. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described
Specific steps include: in step 2
Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, then by filling
Divide the computing capability using computer, finally desired result is calculated;A kind of representation of term vector is one-hot
Representation:
Firstly, counting all vocabulary in corpus, then each vocabulary is numbered, the vector of V dimension is established for each word, to
Each dimension of amount indicates a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0;
Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration;The base
In the method for singular value decomposition:
Word-document matrix
The matrix for establishing word and document, by doing singular value decomposition to this matrix, the vector for obtaining word is indicated;
B, word, word matrix
Contextual window is set, statistics establishes the co-occurrence matrix between word and word, obtains word by doing singular value decomposition to matrix
Vector;
The method based on iteration: specific formalization representation is as follows:
One gram language model, it is assumed that the probability of current word is related with current word,
Two gram language models, it is assumed that the probability of current word is related with previous word,
a、Continuous Bag of Words Model
The probability distribution of given context-prediction target word, first objective function optimize this then by gradient descent method
Neural network;
Objective function is using intersection entropy function:
Due to yjIt is the representation of one-hot, only works as yjWhen=i, objective function is not 0;Objective function becomes:
The calculation formula of predicted value is substituted into, objective function can convert are as follows:
b、Skip-Gram Model
Skip-gram model is the probability value of given target word prediction context, an objective function is set, then using excellent
Change method finds optimal parameter solution, and objective function is as follows:
The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, is utilized
Hierarchical softmax and Negative sampling two methods are trained the model.
6. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described
Term vector matrix is converted by article to be recommended in step 3, and matrix data is standardized, an article is used
One group of vector indicates, and generates intermediate file data, carries out keyword extraction based on TF-IDF algorithm:
The common weighting technique that word frequency-inverse file frequency is prospected for information retrieval and information;A words is assessed for a text
The significance level of part collection or a copy of it file in a corpus;
Word frequency is the number that some given word occurs in this document: TFw=the number that entry w occurs in certain one kind/
All entry numbers in such;
Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by the file comprising the word
Number, then take logarithm to obtain formula obtained quotient:
IDF=log (total number of documents of the corpus/number of files+1 comprising entry w);
TF-IDF=TF*IDF;
Vectorization expression is carried out by obtaining the keyword of every article, then to these keywords, finally merges these vectors then
The vector for obtaining this article indicates.
7. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described
User key words phrase is converted into one group of matrix in step 4, then reads the article matrix that the step 3 obtains, carries out matrix
It calculates, thus obtains a column data, be ranked up according to similarity factor;The standardization formula is as follows:
That is n is standardized vector, its mould is 1, as unit vector;Two n tie up sample point a (x11,x12,…,x1n) and b
(x21,x22,…,x2n) included angle cosine measure the similitude between a and b;
That is:
Obtained cos (θ) is likeness coefficient;The matrix A of the vector composition of article, the keyword vector B of user, similitude
Coefficient vector C=A*B, C (c1, c2..., cn) length be article quantity, wherein n is article sum to be recommended, c1
As number the similarity factor of the article and user key words that are 1;The crucial term vector inputted by article vector sum user
The size of likeness coefficient is recommended article to user, and the bigger article override of likeness coefficient is recommended.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810701560.8A CN110020189A (en) | 2018-06-29 | 2018-06-29 | A kind of article recommended method based on Chinese Similarity measures |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810701560.8A CN110020189A (en) | 2018-06-29 | 2018-06-29 | A kind of article recommended method based on Chinese Similarity measures |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110020189A true CN110020189A (en) | 2019-07-16 |
Family
ID=67188323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810701560.8A Pending CN110020189A (en) | 2018-06-29 | 2018-06-29 | A kind of article recommended method based on Chinese Similarity measures |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110020189A (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110597981A (en) * | 2019-09-16 | 2019-12-20 | 西华大学 | Network news summary system for automatically generating summary by adopting multiple strategies |
CN110597949A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | Court similar case recommendation model based on word vectors and word frequency |
CN110633363A (en) * | 2019-09-18 | 2019-12-31 | 桂林电子科技大学 | Text entity recommendation method based on NLP and fuzzy multi-criterion decision |
CN110851570A (en) * | 2019-11-14 | 2020-02-28 | 中山大学 | Unsupervised keyword extraction method based on Embedding technology |
CN111061957A (en) * | 2019-12-26 | 2020-04-24 | 广东电网有限责任公司 | Article similarity recommendation method and device |
CN111178059A (en) * | 2019-12-07 | 2020-05-19 | 武汉光谷信息技术股份有限公司 | Similarity comparison method and device based on word2vec technology |
CN111651588A (en) * | 2020-06-10 | 2020-09-11 | 扬州大学 | Article abstract information extraction algorithm based on directed graph |
CN111753151A (en) * | 2020-06-24 | 2020-10-09 | 广东科杰通信息科技有限公司 | Service recommendation method based on internet user behaviors |
CN112000867A (en) * | 2020-08-17 | 2020-11-27 | 桂林电子科技大学 | Text classification method based on social media platform |
CN112686026A (en) * | 2021-03-17 | 2021-04-20 | 平安科技(深圳)有限公司 | Keyword extraction method, device, equipment and medium based on information entropy |
TWI727624B (en) * | 2020-01-21 | 2021-05-11 | 兆豐國際商業銀行股份有限公司 | News filtering device and news filtering method |
CN112948568A (en) * | 2019-12-10 | 2021-06-11 | 武汉渔见晚科技有限责任公司 | Content recommendation method and device based on text concept network |
CN112949287A (en) * | 2021-01-13 | 2021-06-11 | 平安科技(深圳)有限公司 | Hot word mining method, system, computer device and storage medium |
CN113554053A (en) * | 2021-05-20 | 2021-10-26 | 重庆康洲大数据有限公司 | Method for comparing similarity of traditional Chinese medicine prescriptions |
CN113742602A (en) * | 2020-05-29 | 2021-12-03 | 中国电信股份有限公司 | Method, apparatus, and computer-readable storage medium for sample optimization |
CN113761323A (en) * | 2020-06-01 | 2021-12-07 | 深圳华大基因科技有限公司 | Document recommendation system and document recommendation method |
TWI749901B (en) * | 2020-11-25 | 2021-12-11 | 重量科技股份有限公司 | Method for forming key information and computer system for the same |
CN114254851A (en) * | 2020-09-24 | 2022-03-29 | Ncr公司 | Commodity similarity handling |
CN114943224A (en) * | 2022-05-07 | 2022-08-26 | 新智道枢(上海)科技有限公司 | Word vector-based alert text keyword extraction method, system, medium, and device |
CN117610543A (en) * | 2023-11-08 | 2024-02-27 | 华南理工大学 | Chinese character and structure association analysis method, medium and equipment based on graph network |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008243024A (en) * | 2007-03-28 | 2008-10-09 | Kyushu Institute Of Technology | Information acquisition device, program therefor and method |
CN101980196A (en) * | 2010-10-25 | 2011-02-23 | 中国农业大学 | Article comparison method and device |
CN103927358A (en) * | 2014-04-15 | 2014-07-16 | 清华大学 | Text search method and system |
CN105183833A (en) * | 2015-08-31 | 2015-12-23 | 天津大学 | User model based microblogging text recommendation method and recommendation apparatus thereof |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
US20170139899A1 (en) * | 2015-11-18 | 2017-05-18 | Le Holdings (Beijing) Co., Ltd. | Keyword extraction method and electronic device |
CN106815297A (en) * | 2016-12-09 | 2017-06-09 | 宁波大学 | A kind of academic resources recommendation service system and method |
CN107133315A (en) * | 2017-05-03 | 2017-09-05 | 有米科技股份有限公司 | A kind of smart media based on semantic analysis recommends method |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
CN107832306A (en) * | 2017-11-28 | 2018-03-23 | 武汉大学 | A kind of similar entities method for digging based on Doc2vec |
-
2018
- 2018-06-29 CN CN201810701560.8A patent/CN110020189A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008243024A (en) * | 2007-03-28 | 2008-10-09 | Kyushu Institute Of Technology | Information acquisition device, program therefor and method |
CN101980196A (en) * | 2010-10-25 | 2011-02-23 | 中国农业大学 | Article comparison method and device |
CN103927358A (en) * | 2014-04-15 | 2014-07-16 | 清华大学 | Text search method and system |
CN105183833A (en) * | 2015-08-31 | 2015-12-23 | 天津大学 | User model based microblogging text recommendation method and recommendation apparatus thereof |
US20170139899A1 (en) * | 2015-11-18 | 2017-05-18 | Le Holdings (Beijing) Co., Ltd. | Keyword extraction method and electronic device |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
CN106815297A (en) * | 2016-12-09 | 2017-06-09 | 宁波大学 | A kind of academic resources recommendation service system and method |
CN107133315A (en) * | 2017-05-03 | 2017-09-05 | 有米科技股份有限公司 | A kind of smart media based on semantic analysis recommends method |
CN107247780A (en) * | 2017-06-12 | 2017-10-13 | 北京理工大学 | A kind of patent document method for measuring similarity of knowledge based body |
CN107832306A (en) * | 2017-11-28 | 2018-03-23 | 武汉大学 | A kind of similar entities method for digging based on Doc2vec |
Non-Patent Citations (2)
Title |
---|
芮伟康: "基于语义的文本向量表示方法研究", 《中国优秀硕士论文全文数据库_信息科技辑》 * |
芮伟康: "基于语义的文本向量表示方法研究", 《中国优秀硕士论文全文数据库_信息科技辑》, 15 January 2018 (2018-01-15), pages 7 - 35 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110597949A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | Court similar case recommendation model based on word vectors and word frequency |
CN110597981A (en) * | 2019-09-16 | 2019-12-20 | 西华大学 | Network news summary system for automatically generating summary by adopting multiple strategies |
CN110633363A (en) * | 2019-09-18 | 2019-12-31 | 桂林电子科技大学 | Text entity recommendation method based on NLP and fuzzy multi-criterion decision |
CN110851570B (en) * | 2019-11-14 | 2023-04-18 | 中山大学 | Unsupervised keyword extraction method based on Embedding technology |
CN110851570A (en) * | 2019-11-14 | 2020-02-28 | 中山大学 | Unsupervised keyword extraction method based on Embedding technology |
CN111178059A (en) * | 2019-12-07 | 2020-05-19 | 武汉光谷信息技术股份有限公司 | Similarity comparison method and device based on word2vec technology |
CN111178059B (en) * | 2019-12-07 | 2023-08-25 | 武汉光谷信息技术股份有限公司 | Similarity comparison method and device based on word2vec technology |
CN112948568B (en) * | 2019-12-10 | 2022-08-30 | 武汉渔见晚科技有限责任公司 | Content recommendation method and device based on text concept network |
CN112948568A (en) * | 2019-12-10 | 2021-06-11 | 武汉渔见晚科技有限责任公司 | Content recommendation method and device based on text concept network |
CN111061957A (en) * | 2019-12-26 | 2020-04-24 | 广东电网有限责任公司 | Article similarity recommendation method and device |
TWI727624B (en) * | 2020-01-21 | 2021-05-11 | 兆豐國際商業銀行股份有限公司 | News filtering device and news filtering method |
CN113742602A (en) * | 2020-05-29 | 2021-12-03 | 中国电信股份有限公司 | Method, apparatus, and computer-readable storage medium for sample optimization |
CN113761323A (en) * | 2020-06-01 | 2021-12-07 | 深圳华大基因科技有限公司 | Document recommendation system and document recommendation method |
CN111651588B (en) * | 2020-06-10 | 2024-03-05 | 扬州大学 | Article abstract information extraction algorithm based on directed graph |
CN111651588A (en) * | 2020-06-10 | 2020-09-11 | 扬州大学 | Article abstract information extraction algorithm based on directed graph |
CN111753151B (en) * | 2020-06-24 | 2023-09-15 | 广东科杰通信息科技有限公司 | Service recommendation method based on Internet user behavior |
CN111753151A (en) * | 2020-06-24 | 2020-10-09 | 广东科杰通信息科技有限公司 | Service recommendation method based on internet user behaviors |
CN112000867A (en) * | 2020-08-17 | 2020-11-27 | 桂林电子科技大学 | Text classification method based on social media platform |
CN114254851A (en) * | 2020-09-24 | 2022-03-29 | Ncr公司 | Commodity similarity handling |
TWI749901B (en) * | 2020-11-25 | 2021-12-11 | 重量科技股份有限公司 | Method for forming key information and computer system for the same |
CN112949287A (en) * | 2021-01-13 | 2021-06-11 | 平安科技(深圳)有限公司 | Hot word mining method, system, computer device and storage medium |
CN112949287B (en) * | 2021-01-13 | 2023-06-27 | 平安科技(深圳)有限公司 | Hot word mining method, system, computer equipment and storage medium |
CN112686026A (en) * | 2021-03-17 | 2021-04-20 | 平安科技(深圳)有限公司 | Keyword extraction method, device, equipment and medium based on information entropy |
CN113554053A (en) * | 2021-05-20 | 2021-10-26 | 重庆康洲大数据有限公司 | Method for comparing similarity of traditional Chinese medicine prescriptions |
CN114943224A (en) * | 2022-05-07 | 2022-08-26 | 新智道枢(上海)科技有限公司 | Word vector-based alert text keyword extraction method, system, medium, and device |
CN117610543A (en) * | 2023-11-08 | 2024-02-27 | 华南理工大学 | Chinese character and structure association analysis method, medium and equipment based on graph network |
CN117610543B (en) * | 2023-11-08 | 2024-08-02 | 华南理工大学 | Chinese character and structure association analysis method, medium and equipment based on graph network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
Wang et al. | Multilayer dense attention model for image caption | |
CN106844658B (en) | Automatic construction method and system of Chinese text knowledge graph | |
Thakkar et al. | Graph-based algorithms for text summarization | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
CN109960786A (en) | Chinese Measurement of word similarity based on convergence strategy | |
CN103207860B (en) | The entity relation extraction method and apparatus of public sentiment event | |
CN107122413A (en) | A kind of keyword extracting method and device based on graph model | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN107247780A (en) | A kind of patent document method for measuring similarity of knowledge based body | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
CN110879834B (en) | Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof | |
CN102622338A (en) | Computer-assisted computing method of semantic distance between short texts | |
CN106484797A (en) | Accident summary abstracting method based on sparse study | |
CN103473280A (en) | Method and device for mining comparable network language materials | |
CN112256861B (en) | Rumor detection method based on search engine return result and electronic device | |
CN114997288B (en) | Design resource association method | |
CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
Sadr et al. | Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms | |
CN109635107A (en) | The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source | |
CN114912449B (en) | Technical feature keyword extraction method and system based on code description text | |
CN112417170B (en) | Relationship linking method for incomplete knowledge graph | |
CN111984782A (en) | Method and system for generating text abstract of Tibetan language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190716 |