Nothing Special   »   [go: up one dir, main page]

CN110020189A - A kind of article recommended method based on Chinese Similarity measures - Google Patents

A kind of article recommended method based on Chinese Similarity measures Download PDF

Info

Publication number
CN110020189A
CN110020189A CN201810701560.8A CN201810701560A CN110020189A CN 110020189 A CN110020189 A CN 110020189A CN 201810701560 A CN201810701560 A CN 201810701560A CN 110020189 A CN110020189 A CN 110020189A
Authority
CN
China
Prior art keywords
article
word
matrix
vector
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810701560.8A
Other languages
Chinese (zh)
Inventor
孙铭鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Zhangyou Technology Co Ltd
Original Assignee
Wuhan Zhangyou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Zhangyou Technology Co Ltd filed Critical Wuhan Zhangyou Technology Co Ltd
Priority to CN201810701560.8A priority Critical patent/CN110020189A/en
Publication of CN110020189A publication Critical patent/CN110020189A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of article recommended method based on Chinese Similarity measures, specific steps include: that the main contents of article are crawled using Python crawlers;Term vector is obtained according to the main contents for crawling article, and is trained;Term vector matrix is converted by article to be recommended;User key words phrase is converted into matrix, then reads the term vector matrix of article conversion obtained in the previous step, and be standardized to term vector matrix data, while carrying out matrix calculating, is arranged according to similarity factor.The present invention provides a kind of article recommended method based on Chinese Similarity measures, Internet user can be helped efficiently to excavate, and article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.

Description

A kind of article recommended method based on Chinese Similarity measures
Technical field
The present invention relates to Internet technical field, more particularly to a kind of article based on Chinese Similarity measures Recommended method.
Background technique
With the continuous development of internet, people's lives habit and life style are undergoing revolutionary variation, interconnect The development of net not only facilitates people's lives, but also also greatly increases the channel that people obtain information.China Internet network Information centre (CNNIC) mentions at " the 36th China Internet network state of development statistical report ", by June, 2015, China Internet news userbase is 5.55 hundred million, wherein cell phone network news userbase is 4.60 hundred million;Internet news is obtained as information The important application of class is taken, utilization rate is only second to instant messaging, comes second.
Under the social background of big data, that user can be allowed logical using Goog l e, Baidu as the search engine of representative Cross the relevant information that input keyword is exactly found oneself needs.But if user can not accurate description meet oneself demand Keyword, search engine can not just play a role.Unlike search engine, recommender system passes through analysis user's The feature of behavior or the contents of a project, thus to find the interested content of user.With major news article information publishing platform The development and growth of (such as wechat public platform), the quantity of article quickly increase, and user is continuous in the difficulty for obtaining article interested It increases, magnanimity article is brought to user also brings very big selection to perplex while extensive information content to user, how to help User efficiently excavates interested article as an information publishing platform major issue urgently to be solved.
Due to lacking enough user interest relevant informations, and the challenge that processing article faces, lead to internet The automatic recommendation effect of upper article is limited, and there are also very big rooms for promotion for similar article proposed algorithm.Article proposed algorithm needs The semantic ambiguity of natural language is coped with using natural language processing technique, syntax obscures, grammer is lack of standardization and word disunity Etc. difficult points, natural language is also converted into the mathematic sign that machine can identify, passes through the means of machine learning and data mining To model, verify.Currently, relevant researchs existing a large amount of for similar article proposed algorithm, such as based on cluster and classification Article recommendation, the article recommendation based on keyword, the recommendation based on specific area hot topic article etc..Although correlative study can be Certain effect is obtained under certain application scenarios, but wherein occur complexity height, smaller scope of application, handmarking's cost The problems such as height, poor recommendation diversity, has limited to the application of article proposed algorithm.
Therefore, how to provide that a kind of that Internet user can be helped efficiently to excavate article interested, the scope of application is larger, artificial It is those skilled in the art that cost is relatively low for label, recommends the diversity preferably article recommended method based on Chinese Similarity measures The problem of urgent need to resolve.
Summary of the invention
In view of this, can help to interconnect the present invention provides a kind of article recommended method based on Chinese Similarity measures Network users efficiently excavate that article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.
To achieve the goals above, the invention provides the following technical scheme:
A kind of article recommended method based on Chinese Similarity measures, specific steps include:
Step 1: the main contents of article are crawled using Python crawlers;
Step 2: obtaining term vector according to the main contents for crawling article, and be trained;
Step 3: converting term vector matrix for article to be recommended;
Step 4: user key words phrase being converted into matrix, then the term vector square of article conversion that read step 3 obtains Battle array, and term vector matrix data is standardized, while carrying out matrix calculating, it is arranged according to similarity factor.
Through the above technical solutions, technical effect of the invention: according to the point of interest of user, recommending the highest text of the degree of correlation Chapter, the algorithm of realization are mainly the calculating of Chinese similitude, and Internet user can be helped efficiently to excavate article interested, be applicable in Range is larger, cost is relatively low by handmarking, recommends diversity preferable.
Preferably, text is crawled in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the step 1 The main contents of chapter specifically include: content of text, head figure and article abstract;The content of text, for generate the word of article to Amount indicates;The head figure, the article that user recommends to user show;The article abstract, is extracted using TextRank algorithm Formula obtains three words of original text, to summarize the main contents of article.
Further, the module for mainly having used Python has a requests, BeautifulSoup and TextRank4Sentence article crawls the content mainly obtained.
Preferably, in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the TextRank algorithm It is based on PageRank, for being text generation keyword and abstract;Entire WWW is a digraph, and node is webpage; There is the link to webpage B in webpage A, the directed edge of webpage B is directed toward from webpage A;After having constructed figure, following formula is used:
S(Vi) be webpage i middle importance (PR value);D is damped coefficient, is traditionally arranged to be 0.85;In(Vi) it is to exist to refer to To the collections of web pages of the link of webpage i;Out(Vj) be the webpage that the existing link of link in webpage j is directed toward set;|Out (Vj) | it is the number of element in set;The importance of webpage, depending on the sum of the importance to each link page of the webpage.
Through the above technical solutions, the solution have the advantages that: each it is linked to the importance S of the webpage of the page (Vj)S(Vj) page gone out to all pages scoring scoring is also needed, therefore divided by OUT (Vj);Meanwhile page S (Vj) S(Vj) importance cannot only by others link the pages determine, also comprising certain probability come determine to want receiving by Other pages determine, this namely effect of d.PageRank needs just be tied using above formula successive ignition Fruit, when initial, the importance that each webpage can be set is 1.Above formula left side of the equal sign calculate the result is that webpage i after iteration PR value, before the PR value that right side of the equal sign is used is full iteration.
Preferably, in a kind of above-mentioned article recommended method based on Chinese Similarity measures, the TextRank algorithm, Original text is split as sentence, stop words is filtered out in each sentence, and only retains the word of specified part of speech;Obtain sentence The set of set and word;Each word is as a node in PageRank;Window size is set as k, it is assumed that a sentence Son is successively made of following word: [w1,w2,…,wk]、[w2,w3,…,wk+1]、[w3,w4,…,wk+2] it is a window; There are a undirected sides had no right between the corresponding node of any two words in either window;Based on figure is constituted above, count Calculate the importance of each word node;Most important several words are as keyword;Key phrase is extracted using TextRank Reference;If there is a situation where in original text, several keywords are adjacent, and keyword constitutes a key phrase;For example, at one In the article for introducing " support vector machines ", three keyword supports, vectors, machine can be found, are extracted by key phrase, it can be with Obtain support vector machines.Extract abstract using TextRank and regard each sentence as a node in figure, if two sentences it Between have similitude, it is believed that have a undirected side of having the right between corresponding two nodes, weight is similarity;It is calculated by PageRank The highest several sentences of the importance that method is calculated are as abstract;Two sentence S are calculated using following formulaiAnd SjSimilarity:
|{wk|wk∈Si&wk∈Sj| it is the quantity of the word all occurred in two sentences;|Si| it is the word of sentence i Number;
Due to being authorized graph, the modification of PageRank formula are as follows:
When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0, The weight w of molecule denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,;Inside textrank4zh module Removable three words for extracting original text of TextRank algorithm, this article to be briefly summarized.
It is to be appreciated that: jieba word segmentation module is used in term vector training, and the Word2vec module of Google, oneself is complete The function of neologisms is identified at one.Because jieba participle is bad to three or more Chinese character new word identification effects, such as small trip Play, automatic Pilot, Internet of Things, block chain etc. extract such neologisms using function, and basic principle is after jieba is segmented Two words such as a and b, they are close to, if c=a+b, i.e. a " small " and b " game " are combined into c " trivial games ", if c The number of appearance is more than certain threshold value, then using c as a neologisms, and jieba dictionary is added.
Preferably, specific in the step 2 in a kind of above-mentioned article recommended method based on Chinese Similarity measures Step includes:
Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, is then led to The computing capability for making full use of computer is crossed, finally desired result is calculated;A kind of representation of term vector is one- The representation of hot:
Firstly, count all vocabulary in corpus, then each vocabulary is numbered, for each word establish V dimension to Amount, each dimension of vector indicate a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0;
Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration;Institute State the method based on singular value decomposition:
Word-document matrix
The matrix for establishing word and document, by doing singular value decomposition to this matrix, the vector for obtaining word is indicated;
B, word, word matrix
Contextual window is set, statistics is established the co-occurrence matrix between word and word, obtained by doing singular value decomposition to matrix Obtain term vector;
The method based on iteration: specific formalization representation is as follows:
One gram language model, it is assumed that the probability of current word is related with current word,
Two gram language models, it is assumed that the probability of current word is related with previous word,
a、Continuous Bag ofWords Model
The probability distribution of given context-prediction target word, first objective function, it is excellent then by gradient descent method Change this neural network;
Objective function is using intersection entropy function:
Due to yjIt is the representation of one-hot, only works as yjWhen=i, objective function is not 0;Objective function becomes:
The calculation formula of predicted value is substituted into, objective function can convert are as follows:
b、Skip-Gram Model
Skip-gram model is the probability value of given target word prediction context, sets an objective function, then adopts Optimal parameter solution is found with optimization method, objective function is as follows:
The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, is utilized Hierarchical softmax and Negative sampling two methods are trained the model.
It preferably, will be in a kind of above-mentioned article recommended method based on Chinese Similarity measures, in the step 3 The article of recommendation is converted into term vector matrix, and is standardized to matrix data, by one group of vector table of an article Show, and generate intermediate file data, carry out keyword extraction based on TF-IDF algorithm: word frequency-inverse file frequency is examined for information The common weighting technique that rope and information are prospected;A words is assessed for a copy of it text in a file set or a corpus The significance level of part;
Word frequency is the number that some given word occurs in this document: TFw=entry w occurs in certain one kind Number/such in all entry number;
Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by comprising the word it The number of file, then take logarithm to obtain formula obtained quotient:
IDF=log (total number of documents of the corpus/number of files+1 comprising entry w);
TF-IDF=TF*IDF;
Carry out vectorization expression by obtaining the keyword of every article, then to these keywords, finally merge these to The vector that amount then obtains this article indicates.
Preferably, it in a kind of above-mentioned article recommended method based on Chinese Similarity measures, will be used in the step 4 Family keyword phrases are converted to one group of matrix, then read the article matrix that the step 3 obtains, and carry out matrix calculating, thus To a column data, it is ranked up according to similarity factor;The standardization formula is as follows:
That is n is standardized vector, its mould is 1, as unit vector;Two n tie up sample point a (x11,x12,…,x1n) and b(x21,x22,…,x2n) included angle cosine measure the similitude between a and b;
That is:
Obtained cos (θ) is likeness coefficient;The matrix A of the vector composition of article, the keyword vector B of user, phase Like property coefficient vector C=A*B, C (c1, c2..., cn) length be article quantity, wherein n is that article to be recommended is total Number, c1As number the similarity factor of the article and user key words that are 1;By article vector sum user input keyword to The size of the likeness coefficient of amount is recommended article to user, and the bigger article override of likeness coefficient is recommended.
It can be seen via above technical scheme that compared with prior art, the present disclosure provides one kind based on Chinese phase Like the article recommended method that property calculates, Internet user can be helped efficiently to excavate, and article interested, the scope of application is larger, artificial mark Cost is relatively low for note, recommends diversity preferable.The main contents of article are crawled first with Python crawlers;Further according to crawling The main contents of article obtain term vector, and are trained;Then term vector matrix is converted by article to be recommended;Finally will User key words phrase is converted to one group of matrix, then reads the term vector matrix of article conversion, carries out matrix calculating, thus obtains One column data, is ranked up according to similarity factor, carries out article recommendation to user.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 attached drawing is flow chart of the invention;
Fig. 2 attached drawing is CBOW model structure schematic diagram of the invention;
Fig. 3 attached drawing is skip-gram model structure schematic diagram of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention 1-3, technical solution in the embodiment of the present invention carry out it is clear, It is fully described by, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a kind of article recommended methods based on Chinese Similarity measures, internet can be helped to use Family efficiently excavates that article interested, the scope of application is larger, cost is relatively low by handmarking, recommends diversity preferable.
As shown in Figure 1, a kind of article recommended method based on Chinese Similarity measures, specific steps include:
Step 1: the main contents of article are crawled using Python crawlers;
Step 2: obtaining term vector according to the main contents for crawling article, and be trained;
Step 3: converting term vector matrix for article to be recommended;
Step 4: user key words phrase being converted into matrix, then the term vector square of article conversion that read step 3 obtains Battle array, and term vector matrix data is standardized, while carrying out matrix calculating, it is arranged according to similarity factor.
In order to further optimize the above technical scheme, the main contents that article is crawled in step 1 specifically include: in text Appearance, head figure and article abstract;Content of text, the term vector for generating article indicate;Head figure, the article that user recommends to user Show;Article abstract, using removable three words for obtaining original text of TextRank algorithm, to summarize the main interior of article Hold.
In order to further optimize the above technical scheme, TextRank algorithm is based on PageRank, for being text generation Keyword and abstract;Entire WWW is a digraph, and node is webpage;There is the link to webpage B in webpage A, from webpage A It is directed toward the directed edge of webpage B;After having constructed figure, following formula is used:
S(Vi) be webpage i middle importance;D is damped coefficient, is traditionally arranged to be 0.85;In(Vi) it is to exist to be directed toward webpage The collections of web pages of the link of i;Out(Vj) be the webpage that the existing link of link in webpage j is directed toward set;|Out(Vj) | it is The number of element in set;The importance of webpage, depending on the sum of the importance to each link page of the webpage.
In order to further optimize the above technical scheme, original text is split as sentence by TextRank algorithm, in each sentence In filter out stop words, and only retain the word of specified part of speech;Obtain the set of sentence and the set of word;Each word conduct A node in PageRank;Window size is set as k, it is assumed that a sentence is successively made of following word: [w1, w2,…,wk]、[w2,w3,…,wk+1]、[w3,w4,…,wk+2] it is a window;Any two words pair in either window There are a undirected sides had no right between the node answered;Based on figure is constituted above, the importance of each word node is calculated;Most Important several words are as keyword;Key phrase reference is extracted using TextRank;If there are several passes in original text The adjacent situation of keyword, keyword constitute a key phrase;Abstract is extracted using TextRank to regard each sentence in figure as A node, if having similitude between two sentences, it is believed that have a undirected side of having the right, weight between corresponding two nodes It is similarity;The highest several sentences of the importance being calculated by PageRank algorithm are as abstract;It is calculated using following formula Two sentence SiAnd SjSimilarity:
|{wk|wk∈Si&wk∈Sj| it is the quantity of the word all occurred in two sentences;|Si| it is the word of sentence i Number;
Due to being authorized graph, the modification of PageRank formula are as follows:
When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0, The weight w of molecule denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,;Inside textrank4zh module Removable three words for extracting original text of TextRank algorithm, this article to be briefly summarized.
In order to further optimize the above technical scheme, specific steps include: in step 2
Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, is then led to The computing capability for making full use of computer is crossed, finally desired result is calculated;A kind of representation of term vector is one- The representation of hot:
Firstly, count all vocabulary in corpus, then each vocabulary is numbered, for each word establish V dimension to Amount, each dimension of vector indicate a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0;
Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration;Base In the method for singular value decomposition:
Word-document matrix
The matrix for establishing word and document, by doing singular value decomposition to this matrix, the vector for obtaining word is indicated;
B, word, word matrix
Contextual window is set, statistics is established the co-occurrence matrix between word and word, obtained by doing singular value decomposition to matrix Obtain term vector;
Method based on iteration: specific formalization representation is as follows:
One gram language model, it is assumed that the probability of current word is related with current word,
Two gram language models, it is assumed that the probability of current word is related with previous word,
a、Continuous Bag ofWords Model(CBOW)
As shown in Fig. 2, the probability distribution of given context-prediction target word, first objective function, then pass through ladder Descent method is spent, this neural network is optimized;
Objective function is using intersection entropy function:
Due to yjIt is the representation of one-hot, only works as yjWhen=i, objective function is not 0;Objective function becomes:
The calculation formula of predicted value is substituted into, objective function can convert are as follows:
b、Skip-Gram Model
As shown in figure 3, skip-gram model is the probability value of given target word prediction context, a target letter is set Number then finds optimal parameter solution using optimization method, and objective function is as follows:
The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, when reducing training Between, the model is trained using Hierarchical softmax and Negative sampling two methods.
In order to further optimize the above technical scheme, term vector matrix is converted by article to be recommended in step 3, and right Matrix data is standardized, and an article is indicated with one group of vector, and generate intermediate file data, and TF-IDF is based on Algorithm carries out keyword extraction:
The common weighting technique that word frequency-inverse file frequency is prospected for information retrieval and information;A words is assessed for one The significance level of a file set or a copy of it file in a corpus;
Word frequency is the number that some given word occurs in this document: TFw=entry w in certain one kind occurs Number/such in all entry number;Some general words for theme there is no too big effect, though it is some go out The less word of existing frequency can express the theme of article, so simple use is TF inappropriate.The design of weight must expire Foot: the ability of a word prediction theme is stronger, and weight is bigger, conversely, weight is smaller.In the article of all statistics, some words are only It is to occur in wherein seldom several articles, then such word is very big to the effect of the theme of article, the weight of these words is answered The design it is larger.
Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by comprising the word it The number of file, then take logarithm to obtain formula obtained quotient:
IDF=log (total number of documents of the corpus/number of files+1 comprising entry w);Why denominator will add 1, be in order to Avoiding denominator is the low document-frequency of high term frequencies and the word in entire file set in 0 a certain specific file, It can produce out the TF-IDF of high weight.
TF-IDF=TF*IDF;
Carry out vectorization expression by obtaining the keyword of every article, then to these keywords, finally merge these to The vector that amount then obtains this article indicates.
In order to further optimize the above technical scheme, user key words phrase is converted into one group of matrix in step 4, then read The article matrix for taking step 3 to obtain carries out matrix calculating, thus obtains a column data, be ranked up according to similarity factor;Standard It is as follows to change processing formula:
That is n is standardized vector, its mould is 1, as unit vector;Two n tie up sample point a (x11,x12,…,x1n) and b(x21,x22,…,x2n) included angle cosine measure the similitude between a and b;
That is:
Obtained cos (θ) is likeness coefficient;The matrix A of the vector composition of article, the keyword vector B of user, phase Like property coefficient vector C=A*B, C (c1, c2..., cn) length be article quantity, wherein n is that article to be recommended is total Number, c1As number the similarity factor of the article and user key words that are 1;By article vector sum user input keyword to The size of the likeness coefficient of amount is recommended article to user, and the bigger article override of likeness coefficient is recommended.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (7)

1. a kind of article recommended method based on Chinese Similarity measures, which is characterized in that specific steps include:
Step 1: the main contents of article are crawled using Python crawlers;
Step 2: obtaining term vector according to the main contents for crawling article, and be trained;
Step 3: converting term vector matrix for article to be recommended;
Step 4: user key words phrase is converted into matrix, then the term vector matrix of article conversion that read step 3 obtains, and Term vector matrix data is standardized, while carrying out matrix calculating, is arranged according to similarity factor.
2. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described The main contents that article is crawled in step 1 specifically include: content of text, head figure and article abstract;The content of text, for giving birth to It is indicated at the term vector of article;The head figure, the article that user recommends to user show;The article abstract uses Removable three words for obtaining original text of TextRank algorithm, to summarize the main contents of article.
3. a kind of article recommended method based on Chinese Similarity measures according to claim 2, which is characterized in that described TextRank algorithm is based on PageRank, for being text generation keyword and abstract;Entire WWW is a digraph, Node is webpage;There is the link to webpage B in webpage A, the directed edge of webpage B is directed toward from webpage A;After having constructed figure, under use The formula in face:
S(Vi) be webpage i middle importance;D is damped coefficient, is traditionally arranged to be 0.85;In(Vi) it is to exist to be directed toward webpage i The collections of web pages of link;Out(Vj) be the webpage that the existing link of link in webpage j is directed toward set;|Out(Vj) | it is collection The number of element in conjunction;The importance of webpage, depending on the sum of the importance to each link page of the webpage.
4. a kind of article recommended method based on Chinese Similarity measures according to claim 3, which is characterized in that described Original text is split as sentence by TextRank algorithm, and stop words is filtered out in each sentence, and only retains the list of specified part of speech Word;Obtain the set of sentence and the set of word;Each word is as a node in PageRank;Set window size as K a, it is assumed that sentence is successively made of following word: [w1,w2,…,wk]、[w2,w3,…,wk+1]、[w3,w4,…,wk+2] It is a window;There are a undirected sides had no right between the corresponding node of any two words in either window;It is based on Figure is constituted above, calculates the importance of each word node;Most important several words are as keyword;Use TextRank Extract key phrase reference;If there is a situation where in original text, several keywords are adjacent, and keyword constitutes a key phrase; Abstract is extracted using TextRank and regards each sentence as a node in figure, if having similitude between two sentences, it is believed that There is a undirected side of having the right between corresponding two nodes, weight is similarity;The weight being calculated by PageRank algorithm The highest several sentences of the property wanted are as abstract;Two sentence S are calculated using following formulaiAnd SjSimilarity:
|{ωkk∈Sik∈Sj| it is the quantity of the word all occurred in two sentences;|Si| it is the word of sentence i Number;
Due to being authorized graph, the modification of PageRank formula are as follows:
When calculating keyword, a word is considered as a sentence, then the weight on the side that all sentences are constituted all is 0, molecule The weight w of denominator reduces, and it is PageRank that TextRank algorithm, which is degenerated,;Pass through the TextRank inside textrank4zh module Removable three words for extracting original text of algorithm, this article to be briefly summarized.
5. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described Specific steps include: in step 2
Step 2.1: the definition of term vector, machine learning task, which needs any input to be quantized into numerical tabular, to be shown, then by filling Divide the computing capability using computer, finally desired result is calculated;A kind of representation of term vector is one-hot Representation:
Firstly, counting all vocabulary in corpus, then each vocabulary is numbered, the vector of V dimension is established for each word, to Each dimension of amount indicates a word, and the dimension numerical value on reference numeral position is 1, other dimensions are all 0;
Step 2.2: the acquisition methods of term vector include: the method based on singular value decomposition and the method based on iteration;The base In the method for singular value decomposition:
Word-document matrix
The matrix for establishing word and document, by doing singular value decomposition to this matrix, the vector for obtaining word is indicated;
B, word, word matrix
Contextual window is set, statistics establishes the co-occurrence matrix between word and word, obtains word by doing singular value decomposition to matrix Vector;
The method based on iteration: specific formalization representation is as follows:
One gram language model, it is assumed that the probability of current word is related with current word,
Two gram language models, it is assumed that the probability of current word is related with previous word,
a、Continuous Bag of Words Model
The probability distribution of given context-prediction target word, first objective function optimize this then by gradient descent method Neural network;
Objective function is using intersection entropy function:
Due to yjIt is the representation of one-hot, only works as yjWhen=i, objective function is not 0;Objective function becomes:
The calculation formula of predicted value is substituted into, objective function can convert are as follows:
b、Skip-Gram Model
Skip-gram model is the probability value of given target word prediction context, an objective function is set, then using excellent Change method finds optimal parameter solution, and objective function is as follows:
The probability in hidden layer and all V of output layer between word is calculated, for the training of Optimized model, is utilized Hierarchical softmax and Negative sampling two methods are trained the model.
6. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described Term vector matrix is converted by article to be recommended in step 3, and matrix data is standardized, an article is used One group of vector indicates, and generates intermediate file data, carries out keyword extraction based on TF-IDF algorithm:
The common weighting technique that word frequency-inverse file frequency is prospected for information retrieval and information;A words is assessed for a text The significance level of part collection or a copy of it file in a corpus;
Word frequency is the number that some given word occurs in this document: TFw=the number that entry w occurs in certain one kind/ All entry numbers in such;
Reverse document-frequency: the reverse document-frequency of a certain particular words, by general act number divided by the file comprising the word Number, then take logarithm to obtain formula obtained quotient:
IDF=log (total number of documents of the corpus/number of files+1 comprising entry w);
TF-IDF=TF*IDF;
Vectorization expression is carried out by obtaining the keyword of every article, then to these keywords, finally merges these vectors then The vector for obtaining this article indicates.
7. a kind of article recommended method based on Chinese Similarity measures according to claim 1, which is characterized in that described User key words phrase is converted into one group of matrix in step 4, then reads the article matrix that the step 3 obtains, carries out matrix It calculates, thus obtains a column data, be ranked up according to similarity factor;The standardization formula is as follows:
That is n is standardized vector, its mould is 1, as unit vector;Two n tie up sample point a (x11,x12,…,x1n) and b (x21,x22,…,x2n) included angle cosine measure the similitude between a and b;
That is:
Obtained cos (θ) is likeness coefficient;The matrix A of the vector composition of article, the keyword vector B of user, similitude Coefficient vector C=A*B, C (c1, c2..., cn) length be article quantity, wherein n is article sum to be recommended, c1 As number the similarity factor of the article and user key words that are 1;The crucial term vector inputted by article vector sum user The size of likeness coefficient is recommended article to user, and the bigger article override of likeness coefficient is recommended.
CN201810701560.8A 2018-06-29 2018-06-29 A kind of article recommended method based on Chinese Similarity measures Pending CN110020189A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810701560.8A CN110020189A (en) 2018-06-29 2018-06-29 A kind of article recommended method based on Chinese Similarity measures

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810701560.8A CN110020189A (en) 2018-06-29 2018-06-29 A kind of article recommended method based on Chinese Similarity measures

Publications (1)

Publication Number Publication Date
CN110020189A true CN110020189A (en) 2019-07-16

Family

ID=67188323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810701560.8A Pending CN110020189A (en) 2018-06-29 2018-06-29 A kind of article recommended method based on Chinese Similarity measures

Country Status (1)

Country Link
CN (1) CN110020189A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597981A (en) * 2019-09-16 2019-12-20 西华大学 Network news summary system for automatically generating summary by adopting multiple strategies
CN110597949A (en) * 2019-08-01 2019-12-20 湖北工业大学 Court similar case recommendation model based on word vectors and word frequency
CN110633363A (en) * 2019-09-18 2019-12-31 桂林电子科技大学 Text entity recommendation method based on NLP and fuzzy multi-criterion decision
CN110851570A (en) * 2019-11-14 2020-02-28 中山大学 Unsupervised keyword extraction method based on Embedding technology
CN111061957A (en) * 2019-12-26 2020-04-24 广东电网有限责任公司 Article similarity recommendation method and device
CN111178059A (en) * 2019-12-07 2020-05-19 武汉光谷信息技术股份有限公司 Similarity comparison method and device based on word2vec technology
CN111651588A (en) * 2020-06-10 2020-09-11 扬州大学 Article abstract information extraction algorithm based on directed graph
CN111753151A (en) * 2020-06-24 2020-10-09 广东科杰通信息科技有限公司 Service recommendation method based on internet user behaviors
CN112000867A (en) * 2020-08-17 2020-11-27 桂林电子科技大学 Text classification method based on social media platform
CN112686026A (en) * 2021-03-17 2021-04-20 平安科技(深圳)有限公司 Keyword extraction method, device, equipment and medium based on information entropy
TWI727624B (en) * 2020-01-21 2021-05-11 兆豐國際商業銀行股份有限公司 News filtering device and news filtering method
CN112948568A (en) * 2019-12-10 2021-06-11 武汉渔见晚科技有限责任公司 Content recommendation method and device based on text concept network
CN112949287A (en) * 2021-01-13 2021-06-11 平安科技(深圳)有限公司 Hot word mining method, system, computer device and storage medium
CN113554053A (en) * 2021-05-20 2021-10-26 重庆康洲大数据有限公司 Method for comparing similarity of traditional Chinese medicine prescriptions
CN113742602A (en) * 2020-05-29 2021-12-03 中国电信股份有限公司 Method, apparatus, and computer-readable storage medium for sample optimization
CN113761323A (en) * 2020-06-01 2021-12-07 深圳华大基因科技有限公司 Document recommendation system and document recommendation method
TWI749901B (en) * 2020-11-25 2021-12-11 重量科技股份有限公司 Method for forming key information and computer system for the same
CN114254851A (en) * 2020-09-24 2022-03-29 Ncr公司 Commodity similarity handling
CN114943224A (en) * 2022-05-07 2022-08-26 新智道枢(上海)科技有限公司 Word vector-based alert text keyword extraction method, system, medium, and device
CN117610543A (en) * 2023-11-08 2024-02-27 华南理工大学 Chinese character and structure association analysis method, medium and equipment based on graph network

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008243024A (en) * 2007-03-28 2008-10-09 Kyushu Institute Of Technology Information acquisition device, program therefor and method
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN103927358A (en) * 2014-04-15 2014-07-16 清华大学 Text search method and system
CN105183833A (en) * 2015-08-31 2015-12-23 天津大学 User model based microblogging text recommendation method and recommendation apparatus thereof
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
US20170139899A1 (en) * 2015-11-18 2017-05-18 Le Holdings (Beijing) Co., Ltd. Keyword extraction method and electronic device
CN106815297A (en) * 2016-12-09 2017-06-09 宁波大学 A kind of academic resources recommendation service system and method
CN107133315A (en) * 2017-05-03 2017-09-05 有米科技股份有限公司 A kind of smart media based on semantic analysis recommends method
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008243024A (en) * 2007-03-28 2008-10-09 Kyushu Institute Of Technology Information acquisition device, program therefor and method
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN103927358A (en) * 2014-04-15 2014-07-16 清华大学 Text search method and system
CN105183833A (en) * 2015-08-31 2015-12-23 天津大学 User model based microblogging text recommendation method and recommendation apparatus thereof
US20170139899A1 (en) * 2015-11-18 2017-05-18 Le Holdings (Beijing) Co., Ltd. Keyword extraction method and electronic device
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN106815297A (en) * 2016-12-09 2017-06-09 宁波大学 A kind of academic resources recommendation service system and method
CN107133315A (en) * 2017-05-03 2017-09-05 有米科技股份有限公司 A kind of smart media based on semantic analysis recommends method
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
芮伟康: "基于语义的文本向量表示方法研究", 《中国优秀硕士论文全文数据库_信息科技辑》 *
芮伟康: "基于语义的文本向量表示方法研究", 《中国优秀硕士论文全文数据库_信息科技辑》, 15 January 2018 (2018-01-15), pages 7 - 35 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597949A (en) * 2019-08-01 2019-12-20 湖北工业大学 Court similar case recommendation model based on word vectors and word frequency
CN110597981A (en) * 2019-09-16 2019-12-20 西华大学 Network news summary system for automatically generating summary by adopting multiple strategies
CN110633363A (en) * 2019-09-18 2019-12-31 桂林电子科技大学 Text entity recommendation method based on NLP and fuzzy multi-criterion decision
CN110851570B (en) * 2019-11-14 2023-04-18 中山大学 Unsupervised keyword extraction method based on Embedding technology
CN110851570A (en) * 2019-11-14 2020-02-28 中山大学 Unsupervised keyword extraction method based on Embedding technology
CN111178059A (en) * 2019-12-07 2020-05-19 武汉光谷信息技术股份有限公司 Similarity comparison method and device based on word2vec technology
CN111178059B (en) * 2019-12-07 2023-08-25 武汉光谷信息技术股份有限公司 Similarity comparison method and device based on word2vec technology
CN112948568B (en) * 2019-12-10 2022-08-30 武汉渔见晚科技有限责任公司 Content recommendation method and device based on text concept network
CN112948568A (en) * 2019-12-10 2021-06-11 武汉渔见晚科技有限责任公司 Content recommendation method and device based on text concept network
CN111061957A (en) * 2019-12-26 2020-04-24 广东电网有限责任公司 Article similarity recommendation method and device
TWI727624B (en) * 2020-01-21 2021-05-11 兆豐國際商業銀行股份有限公司 News filtering device and news filtering method
CN113742602A (en) * 2020-05-29 2021-12-03 中国电信股份有限公司 Method, apparatus, and computer-readable storage medium for sample optimization
CN113761323A (en) * 2020-06-01 2021-12-07 深圳华大基因科技有限公司 Document recommendation system and document recommendation method
CN111651588B (en) * 2020-06-10 2024-03-05 扬州大学 Article abstract information extraction algorithm based on directed graph
CN111651588A (en) * 2020-06-10 2020-09-11 扬州大学 Article abstract information extraction algorithm based on directed graph
CN111753151B (en) * 2020-06-24 2023-09-15 广东科杰通信息科技有限公司 Service recommendation method based on Internet user behavior
CN111753151A (en) * 2020-06-24 2020-10-09 广东科杰通信息科技有限公司 Service recommendation method based on internet user behaviors
CN112000867A (en) * 2020-08-17 2020-11-27 桂林电子科技大学 Text classification method based on social media platform
CN114254851A (en) * 2020-09-24 2022-03-29 Ncr公司 Commodity similarity handling
TWI749901B (en) * 2020-11-25 2021-12-11 重量科技股份有限公司 Method for forming key information and computer system for the same
CN112949287A (en) * 2021-01-13 2021-06-11 平安科技(深圳)有限公司 Hot word mining method, system, computer device and storage medium
CN112949287B (en) * 2021-01-13 2023-06-27 平安科技(深圳)有限公司 Hot word mining method, system, computer equipment and storage medium
CN112686026A (en) * 2021-03-17 2021-04-20 平安科技(深圳)有限公司 Keyword extraction method, device, equipment and medium based on information entropy
CN113554053A (en) * 2021-05-20 2021-10-26 重庆康洲大数据有限公司 Method for comparing similarity of traditional Chinese medicine prescriptions
CN114943224A (en) * 2022-05-07 2022-08-26 新智道枢(上海)科技有限公司 Word vector-based alert text keyword extraction method, system, medium, and device
CN117610543A (en) * 2023-11-08 2024-02-27 华南理工大学 Chinese character and structure association analysis method, medium and equipment based on graph network
CN117610543B (en) * 2023-11-08 2024-08-02 华南理工大学 Chinese character and structure association analysis method, medium and equipment based on graph network

Similar Documents

Publication Publication Date Title
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
Wang et al. Multilayer dense attention model for image caption
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
Thakkar et al. Graph-based algorithms for text summarization
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN109960786A (en) Chinese Measurement of word similarity based on convergence strategy
CN103207860B (en) The entity relation extraction method and apparatus of public sentiment event
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN107992542A (en) A kind of similar article based on topic model recommends method
CN107247780A (en) A kind of patent document method for measuring similarity of knowledge based body
CN110674252A (en) High-precision semantic search system for judicial domain
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN102622338A (en) Computer-assisted computing method of semantic distance between short texts
CN106484797A (en) Accident summary abstracting method based on sparse study
CN103473280A (en) Method and device for mining comparable network language materials
CN112256861B (en) Rumor detection method based on search engine return result and electronic device
CN114997288B (en) Design resource association method
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
Sadr et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms
CN109635107A (en) The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source
CN114912449B (en) Technical feature keyword extraction method and system based on code description text
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN111984782A (en) Method and system for generating text abstract of Tibetan language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190716