Nothing Special   »   [go: up one dir, main page]

CN113392305A - Keyword extraction method and device, electronic equipment and computer storage medium - Google Patents

Keyword extraction method and device, electronic equipment and computer storage medium Download PDF

Info

Publication number
CN113392305A
CN113392305A CN202011342868.1A CN202011342868A CN113392305A CN 113392305 A CN113392305 A CN 113392305A CN 202011342868 A CN202011342868 A CN 202011342868A CN 113392305 A CN113392305 A CN 113392305A
Authority
CN
China
Prior art keywords
vocabulary
search text
predicate
vocabularies
subject
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011342868.1A
Other languages
Chinese (zh)
Inventor
康战辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011342868.1A priority Critical patent/CN113392305A/en
Publication of CN113392305A publication Critical patent/CN113392305A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a keyword extraction method and device, electronic equipment and a computer storage medium, wherein the method comprises the following steps: acquiring a search text; if the search text is of a question type, performing part-of-speech analysis on each vocabulary of the search text to obtain the part-of-speech of each vocabulary; performing syntactic analysis on each vocabulary by using a dependency syntactic algorithm to obtain a syntactic relation between every two vocabularies with syntactic relations; selecting vocabularies corresponding to the subject, the predicate and the object from all the vocabularies of the search text based on the syntactic relation between every two vocabularies with syntactic relation; generating a keyword set by using vocabularies corresponding to the subject, the predicate and the object respectively; the keyword set includes: the sentence is composed of words corresponding to the subject, the predicate and the object respectively, and the words corresponding to the subject, the predicate and the object respectively. Therefore, keywords meeting the requirements of the user are extracted from the question sentence.

Description

Keyword extraction method and device, electronic equipment and computer storage medium
Technical Field
The present application relates to the field of information search technologies, and in particular, to a keyword extraction method and apparatus, an electronic device, and a computer storage medium.
Background
In the field of artificial intelligence, with the development of internet search engines, search requirements are ubiquitous. Since there is a search requirement in the context of chat conversations, many social software applications are beginning to gradually support session fingertip searches, i.e., users can select a session in a chat and search with the session as a query string.
In order to further give a search initiative to a user and provide better experience for the user, the fingertip search is the same as an application scene of the prior art that the user actively inputs words and sentences for searching, and a plurality of candidate keywords can be provided for the user to select while the current search is carried out, so that the selected candidate keywords are used as search keywords for further searching information. In the same way as in the prior application scenario, in the fingertip search, words contained in the entity dictionary are selected from the conversation selected by the user as candidate keywords based on the entity dictionary constructed in advance, and are used for the user to select.
However, this method of extracting candidate keywords can extract only words in the entity dictionary, and cannot extract words not included in the entity dictionary. In addition, words in the entity dictionary are usually nouns, and in the past application scenario, a user inputs a general noun or a statement sentence for searching, so that the extracted noun can meet the information search requirement of the user, but a questioning sentence is often included in a chat session, and the information search requirement of the user cannot be met if only the noun words are extracted, for example, as shown in fig. 1, the user selects whether the application a needs to add a tablet function, at this time, only the application a and the tablet are extracted as candidate keywords through the entity dictionary, and then information obtained by subsequently searching the application a and the tablet is obviously not in accordance with the initial search requirement of the user. Therefore, it can be seen that the existing keyword extraction method cannot well extract keywords meeting the subsequent search requirements of the user for question sentences in fingertip search.
Disclosure of Invention
Based on the defects of the prior art, the application provides a keyword extraction method and device, electronic equipment and a computer storage medium, so as to solve the problem that the keyword meeting the user requirement cannot be effectively extracted in the prior art.
In order to achieve the above object, the present application provides the following technical solutions:
the first aspect of the present application provides a keyword extraction method, which is characterized by including:
acquiring a search text;
judging whether the search text is of a question type;
if the search text is judged to be the question type, performing part-of-speech analysis on each vocabulary of the search text to obtain the part-of-speech of each vocabulary;
performing syntactic analysis on each vocabulary by using a dependency syntactic algorithm to obtain a syntactic relation between every two vocabularies with syntactic relations;
selecting vocabularies corresponding to the subject, the predicate and the object from all the vocabularies of the search text based on the syntactic relation between every two vocabularies with syntactic relation;
generating a keyword set by using vocabularies corresponding to the subject, the predicate and the object respectively; wherein the set of keywords comprises: the sentence is composed of words corresponding to the subject, the predicate and the object respectively, and the words corresponding to the subject, the predicate and the object respectively.
Optionally, in the method for extracting a keyword, the determining whether the search text is a question type includes:
performing word segmentation on the search text to obtain each word of the search text;
respectively carrying out feature processing on each vocabulary to obtain a feature vector of each vocabulary;
and calling a pre-trained convolutional neural network model to process the feature vector of each vocabulary, and determining whether the search text is of a question type.
Optionally, in the above method for extracting a keyword, the method for training a convolutional neural network model includes:
acquiring a plurality of problem titles and a plurality of news titles;
taking each problem title and each news title as training sample data; wherein each problem title is used as positive training sample data, and each news title is used as negative training sample data;
segmenting each training sample data to obtain a sample vocabulary corresponding to each training sample data;
respectively carrying out feature processing on each sample vocabulary to obtain a feature vector of each sample vocabulary;
inputting the feature vectors of the sample vocabularies corresponding to the training sample data into a convolutional neural network model, and calculating through the convolutional neural network model to obtain a classification result of the training sample data;
if the error between the classification result of the training sample data and the label of the training sample data is larger than a preset threshold value, performing parameter adjustment on the convolutional neural network model, and returning to execute the step of inputting the feature vector of each sample vocabulary corresponding to the training sample data into the convolutional neural network model; wherein the label of the positive training sample data is 1, and the label of the negative training sample data is 0;
and if the error between the classification result of the training sample data and the label of the training sample data is not greater than a preset threshold value, determining that the training of the convolutional neural network model is finished.
Optionally, in the method for extracting a keyword, after selecting vocabularies corresponding to the subject, the predicate, and the object from the vocabularies of the search text, the method further includes:
if the vocabulary meeting the merging standard exists in the vocabulary corresponding to the selected subject, merging the vocabulary meeting the merging standard into one vocabulary; wherein the merging criteria is: the corresponding vocabulary of subject is multiple, and the position in the search text is continuous.
Optionally, in the method for extracting a keyword, after selecting vocabularies corresponding to the subject, the predicate, and the object from the vocabularies of the search text, the method further includes:
comparing the vocabulary corresponding to the predicate with preset target predicate words;
and removing the vocabulary corresponding to the predicate matched with the target predicate.
Optionally, in the method for extracting a keyword, after generating a keyword set by using vocabularies corresponding to the subject, the predicate, and the object, the method further includes:
calling a historical search record of a user;
determining the historical search times of each vocabulary in the keyword set by using the historical search records;
and removing the vocabulary with the historical search times smaller than the preset times from the keyword set.
The second aspect of the present application provides an apparatus for extracting a keyword, including:
a first acquisition unit configured to acquire a search text;
the judging unit is used for judging whether the search text is of a question type or not;
the part of speech analysis unit is used for analyzing the part of speech of each vocabulary of the search text to obtain the part of speech of each vocabulary when the search text is of a question type;
the syntax analysis unit is used for carrying out syntax analysis on each vocabulary by utilizing a dependency syntax algorithm to obtain the syntax relation between every two vocabularies with the syntax relation;
the extraction unit is used for selecting vocabularies corresponding to the subject, the predicate and the object from all the vocabularies of the search text based on the syntactic relation between every two vocabularies with syntactic relation;
the generating unit is used for generating a keyword set by utilizing vocabularies corresponding to the subject, the predicate and the object respectively; wherein the set of keywords comprises: the sentence is composed of words corresponding to the subject, the predicate and the object respectively, and the words corresponding to the subject, the predicate and the object respectively.
Optionally, in the above apparatus for extracting a keyword, the determining unit includes:
the first word segmentation unit is used for performing word segmentation on the search text to obtain each word of the search text;
the first characteristic processing unit is used for respectively carrying out characteristic processing on each vocabulary to obtain a characteristic vector of each vocabulary;
and the classification unit is used for calling a pre-trained convolutional neural network model to process the feature vector of each vocabulary and determining whether the search text is of a question type.
Optionally, in the above apparatus for extracting a keyword, the apparatus further includes a model training unit, where the model training unit includes:
a second obtaining unit configured to obtain a plurality of question titles and a plurality of news titles;
a sample unit, configured to use each of the problem titles and each of the news titles as training sample data; wherein each problem title is used as positive training sample data, and each news title is used as negative training sample data;
the second word segmentation unit is used for segmenting words of each training sample data to obtain a sample vocabulary corresponding to each training sample data;
the second characteristic processing unit is used for respectively carrying out characteristic processing on each sample vocabulary to obtain a characteristic vector of each sample vocabulary;
the input unit is used for inputting the feature vectors of the sample vocabularies corresponding to the training sample data into a convolutional neural network model, and calculating through the convolutional neural network model to obtain the classification result of the training sample data;
a parameter adjusting unit, configured to, when an error between the classification result of the training sample data and the label of the training sample data is greater than a preset threshold, perform parameter adjustment on the convolutional neural network model, and return to perform the input of the feature vector of each sample vocabulary corresponding to the training sample data into the convolutional neural network model; wherein the label of the positive training sample data is 1, and the label of the negative training sample data is 0;
a first determining unit, configured to determine that training of the convolutional neural network model is completed when an error between the classification result of the training sample data and the label of the training sample data is not greater than a preset threshold.
Optionally, in the above apparatus for extracting a keyword, the apparatus further includes:
the merging unit is used for merging the words meeting the merging standard into one word when the words meeting the merging standard exist in the words corresponding to the selected subject; wherein the merging criteria is: the corresponding vocabulary of subject is multiple, and the position in the search text is continuous.
Optionally, in the above apparatus for extracting a keyword, the apparatus further includes:
the comparison unit is used for comparing the vocabulary corresponding to the predicates with preset target predicate words;
and the first removing unit is used for removing the vocabulary corresponding to the predicate matched with the target predicate.
Optionally, in the above apparatus for extracting a keyword, the apparatus further includes:
the retrieval unit is used for retrieving the historical search records of the user;
a second determining unit, configured to determine, by using the historical search record, a historical search frequency of each vocabulary in the keyword set;
and the second removing unit is used for removing the vocabulary with the historical search frequency less than the preset frequency from the keyword set.
A third aspect of the present application provides a computer storage medium for storing a computer program, which when executed, is configured to implement the keyword extraction method as described in any one of the above.
A fourth aspect of the present application provides an electronic device, comprising:
a memory and a processor;
wherein the memory is used for storing programs;
the processor is configured to execute the program, and when the program is executed, the processor is specifically configured to implement the keyword extraction method according to any one of the above items.
According to the method for generating the keywords, the search text is obtained, and when the search text is determined to be of a question type, part-of-speech analysis is performed on each vocabulary of the search text, so that the part-of-speech of each vocabulary is obtained, syntactic analysis can be performed on each vocabulary by using a dependency syntactic algorithm based on the part-of-speech of each vocabulary, a syntactic relation between every two vocabularies with syntactic relation is obtained, finally, vocabularies corresponding to a subject, a predicate and an object are selected from each vocabulary of the search text based on the syntactic relation between every two vocabularies with syntactic relation, and a keyword set of sentences composed of each selected vocabulary and each selected vocabulary is generated by using the vocabularies corresponding to the subject, the predicate and the object. Therefore, the extraction of the keywords of the question is realized based on the dependency syntax algorithm, the keywords are not extracted by using the entity dictionary any more, the extraction of the keywords is more flexible, the method is better suitable for various chat conversations, and the keywords meeting the user requirements can be accurately extracted from the question.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic view of an operation interface for fingertip search;
fig. 2 is a flowchart of a keyword extraction method provided in an embodiment of the present application;
fig. 3 is a flowchart of a method for determining whether a search text is a question type according to another embodiment of the present application;
FIG. 4 is a flowchart of a method for training a convolutional neural network model according to another embodiment of the present disclosure;
FIG. 5 is a schematic structural diagram of a convolutional neural network model according to another embodiment of the present disclosure;
FIG. 6 is a diagram illustrating an example of lexical analysis of words provided in accordance with another embodiment of the present application;
FIG. 7 is a diagram of an example of syntactic analysis provided in another embodiment of the present application;
fig. 8 is a schematic structural diagram of an apparatus for extracting keywords according to another embodiment of the present application;
FIG. 9 is a schematic structural diagram of a model training unit according to another embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to another embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In this application, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The application provides a keyword extraction method which is mainly applied to the field of artificial intelligence. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
More specifically, the method for extracting keywords provided by the present application belongs to a Natural Language Processing (NLP) direction in the field of artificial intelligence. Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
The embodiment of the application provides a keyword extraction method, as shown in fig. 2, specifically including the following steps:
s201, obtaining a search text.
First, the embodiments of the present application mainly use a fingertip search as an example for description, but the method provided by the present application is not limited to be used in the fingertip search.
Wherein, the search text refers to the text content input or selected by the user for searching. For fingertip searches, the user may trigger a specified action on the chat session. For example, as shown in FIG. 1, a user may select text in a chat session by long pressing the current chat session "A application needs to add a tablet functionality". At this time, the system will correspondingly pop up virtual keys such as search for one, copy, cut and the like of the search key. After the user clicks the search key to search for one, as shown in the right diagram of fig. 1, the system pops up a search interface, and performs information search using the selected chat session as a search text and feeds back the search result to the user. Therefore, when the user selects the session to search, the search text can be obtained, and when the system searches for the search word at the time, the keyword extraction is carried out on the obtained search text, so that the user can select to carry out subsequent search.
S202, judging whether the search text is of a question type.
It should be noted that the method for extracting keywords provided in the embodiment of the present application is directed to a question type search text, that is, a question-in-question pair pointer.
Therefore, if the search text is determined to be of the question type, step S203 is executed.
Optionally, referring to fig. 3, a method for determining whether a search text is a question type after the search text is obtained is shown, and specifically includes the following steps:
s301, performing word segmentation on the search text to obtain each word of the search text.
It should be noted that a text unit of each word in each word is obtained, and the text unit may be a word or only one character, and is not limited to a word composed of two or more characters.
Alternatively, the search text may be segmented using existing segmentation tools, such as the Chinese segmentation, and the meaningless words, such as stop words, such as "Didi, Do", etc., may be removed.
And S302, respectively carrying out feature processing on each vocabulary to obtain a feature vector of each vocabulary.
Alternatively, the vocabulary may also be characterized using existing tools, such as the word2vec model.
Specifically, a plurality of search texts with high occurrence frequency may be collected in advance, and the search texts are respectively subjected to word segmentation to obtain a plurality of words, and the set of the words is called a bag of words.
Then, the bag of words is used for training, and a word vector model (i.e. word2vec model) is obtained. Based on the word vector model obtained by training, when step S302 is executed, the word vector of each vocabulary obtained by word segmentation in step S301 can be found from the word vector model.
For example, the search text may be "do you consider adding a handwriting pad function in asking for a flight meeting? "segmenting this search text can result in: asking for questions, Tencent, meeting, middle, whether, considering, adding, writing pad, function, then finding out the word vector of each vocabulary from the word vector model.
And S303, calling a pre-trained convolutional neural network model to process the feature vector of each vocabulary, and determining whether the search text is of a question type.
Specifically, the feature vectors of each vocabulary obtained by segmenting the search text are input into a pre-trained convolutional neural network model as a whole, and the input feature vectors of each vocabulary are processed through the convolutional neural network model to obtain the probability that the search text belongs to the question type. And if the probability that the search text output by the convolutional neural network model belongs to the question type is greater than the preset probability value, determining the search text to be the question type.
Optionally, the embodiment of the present application adopts a convolutional neural network model as a classifier for determining whether the search text is a text type. Of course, other classifiers that can determine whether the search text is text type can be used.
Optionally, in another embodiment of the present application, a training method of the convolutional neural network model in step S303 is provided, as shown in fig. 4, specifically including the following steps:
s401, obtaining a plurality of question titles and a plurality of news titles.
It should be noted that, because the chat conversation involves a privacy problem of the user, it is not suitable to obtain the chat conversation of the user and screen out the training samples from the chat conversation, and therefore, in the embodiment of the present application, a problem title and a news title are used as the training samples. Of course, if authorized by the user, the model training may also be performed using the user's chat conversation as a training sample.
The question title mainly refers to a title of a question posed by a user on a question and answer website or a client, for example, a hundred-degree known question title. The question titles belong to question types, namely questions asked by the user, and the data can reflect the characteristics of the types, the question asking modes and the like of most of the questions asked by the user and further utilize the training of the model. The news headline is usually of statement sentence type, so it can be used as a negative training sample.
S402, taking each question title and each news title as training sample data.
Each question title is used as positive training sample data, and each news title is used as negative training sample data. Meanwhile, labels are also required to be marked on each training sample data, wherein the label of positive training sample data is 1, and the label of negative training sample data is 0.
And S403, segmenting each training sample data to obtain a sample vocabulary corresponding to each training sample data.
Similarly, the existing word segmentation tool can be used for performing word segmentation on the training sample, and performing subsequent processing such as removing stop words, so as to obtain a sample vocabulary corresponding to each training sample data.
And S404, respectively carrying out feature processing on each sample word to obtain a feature vector of each sample word.
Similarly, the feature of each sample vocabulary can be processed by using the existing feature processing model, so that the feature vector of each castrated version vocabulary is obtained.
S405, inputting the feature vectors of the sample vocabularies corresponding to the training sample data into a convolutional neural network model, and calculating through the convolutional neural network model to obtain the classification result of the training sample data.
Specifically, as shown in fig. 5, the convolutional neural network model mainly includes an input layer, a convolutional layer, a pooling layer, and a full link layer. And inputting the characteristic vectors of all sample vocabularies of the training sample data into the convolutional neural network model through an input layer, and inputting the classification result of the training sample data after the processing of the convolutional layer, the pooling layer and the full-connection layer in sequence.
S406, judging that the error between the classification result of the training sample data and the label of the training sample data is larger than a preset threshold value.
If the error between the classification result of the training sample data and the label of the training sample data is greater than the preset threshold, step S407 is executed. If the error between the classification result of the training sample data and the label of the training sample data is not greater than the preset threshold, step S408 is executed.
And S407, performing parameter adjustment on the convolutional neural network model.
Specifically, the parameter adjustment mode may be adjusted based on a gradient descent method. Specifically, a loss function of parameters related to the convolutional neural network model is determined, and then partial derivatives are respectively calculated for each parameter in the loss function to obtain corresponding gradient vectors. Then, each corresponding parameter is adjusted according to the preset step length along the opposite direction of the gradient vector corresponding to each parameter, and after the parameter is adjusted, the procedure returns to step S405, and the model continues to be trained.
And S408, determining that the training of the convolutional neural network model is completed.
And S203, performing part-of-speech analysis on each vocabulary of the search text to obtain the part-of-speech of each vocabulary.
Alternatively, the parts of speech of each vocabulary may be tagged using the presently disclosed part of speech tagging tool, poslag. For example, as shown in fig. 6, for a search text "asking for whether to add a tablet function in an inquiry meeting, the vocabulary obtained after the word segmentation is" asking for, meeting, middle, whether, considering, adding, writing tablet, function ", and then the part of speech between each vocabulary is analyzed to obtain the analysis result shown in fig. 6. Wherein n refers to nouns, v refers to verbs, nz and nd refer to proper nouns and other nouns, respectively.
And S204, carrying out syntactic analysis on each vocabulary by using a dependency syntactic algorithm to obtain the syntactic relation between every two vocabularies with syntactic relation.
The basic task of syntactic analysis is to determine the syntactic structure of a sentence or the dependency between words in the sentence. The method mainly comprises two aspects, namely, determining a grammar system of a language, namely, giving formal definition to a grammar structure of a legal sentence in the language; another aspect is syntactic analysis techniques, i.e. the automatic derivation of the syntactic structure of a sentence, according to a given syntactic hierarchy, the analysis of the syntactic units contained in the sentence and the relations between these syntactic units.
Some common syntax relationships are shown in table 1 below.
TABLE 1
Figure BDA0002799029820000111
Figure BDA0002799029820000121
Therefore, each vocabulary is subjected to syntactic analysis by using a dependency syntactic algorithm based on the part of speech of each vocabulary, and the syntactic relation between every two vocabularies with syntactic relation is obtained. The specific process of analyzing by using the dependency syntax algorithm is the same as that of the prior art, and is not described herein again. For example, as shown in fig. 7, the syntactic relation shown in fig. 7 is obtained by performing syntactic analysis on the words "ask, announcement, meeting, middle, whether, consider, add, handwriting pad, and function" obtained after word segmentation.
And S205, selecting vocabularies corresponding to the subject, the predicate and the object from the vocabularies of the search text based on the syntactic relation between every two vocabularies with the syntactic relation.
After every two syntactic relations with syntactic relations are obtained, according to the syntactic relations among all vocabularies and the parts of speech of all the vocabularies, the played syntactic components of all the vocabularies in the search text can be determined. Since the subject, the predicate, and the object are usually the stems of a sentence, that is, key components in a sentence, and the meaning expressed by a sentence is usually included in the subject, the predicate, and the object, in the embodiment of the present application, words corresponding to the subject, the predicate, and the object are extracted. It should be noted that, since a chat session is relatively arbitrary, the search text does not necessarily include three parts of the subject, the predicate, and the object at the same time, but when extracting the vocabulary, the vocabulary corresponding to the subject, the predicate, and the object is searched, and the vocabulary corresponding to any one of the subject, the predicate, and the object is extracted.
Optionally, in another embodiment of the present application, after step S205 is executed, the following may be further executed: if the vocabulary meeting the merging standard exists in the vocabulary corresponding to the selected subject, merging the vocabulary meeting the merging standard into one vocabulary.
Wherein the merging criteria are: the corresponding vocabulary of subject has a plurality of, and the position in the search text is continuous.
Because when a plurality of words are in the position of the subject and the positions in the search text are continuous, the words are often complete names, and are divided into a plurality of words during word segmentation. In the above example, the two words "Tengchong and meeting" should be merged into "Tengchong meeting", and if not merged, the extracted words will be different from the meaning originally expressed by the search text, so as to provide the user with the keywords that do not meet the user's requirements.
In addition, since the search text is not necessarily a complete, smooth sentence that conforms to the grammatical rules, there may be some meaningless predicates that do not have any effect on subsequent searches, so removal may be selected. Therefore, in another embodiment of the present application, after performing step S205, the method may further include: and comparing the vocabulary corresponding to the predicate with a preset target predicate word. And if the vocabulary corresponding to the predicate matching the target predicate is compared, removing the vocabulary corresponding to the predicate matching the target predicate. Wherein, the target predicate refers to a preset meaningless predicate, such as: "have, know, go, play", etc. For example, for the search text "beijing has sight", the predicate "has" obviously has no meaning, and the extracted "beijing" and "sight" obviously can meet the subsequent search requirements of the user, and are therefore eliminated.
S206, generating a keyword set by using the words corresponding to the subject, the predicate and the object.
Wherein, the keyword set includes: the vocabulary corresponding to the subject, the predicate and the object respectively, and the sentence composed of the vocabulary corresponding to the subject, the predicate and the object respectively, that is, the keyword set include not only each extracted vocabulary, but also the completed sentence composed of the vocabularies, that is, the complete sentence composed of the subject, the predicate and the object together. For example, for the search text "please ask about whether to consider adding a handwriting pad function in the Tencent meeting? "segmenting this search text can result in: asking for questions, Tencent, meeting, middle, whether, considering, adding, writing pad and function. Referring also to the syntactic relationship shown in fig. 6, the resulting vocabulary corresponding to the subject is: in the tenuous meeting, the vocabulary corresponding to the predicate is as follows: increasing, the vocabulary corresponding to the object is: the handwriting board and the vocabularies corresponding to the subject, the predicate and the object respectively just can form a sentence "Tencent meeting adds handwriting board", so the generated keyword set is: tengchong meeting, increase, handwriting pad, Tengchong meeting increase handwriting pad.
Obviously, the composed sentences correspond to the text after the search text is reduced. After a user selects a search text to search, if the obtained search result is not accurate enough or not satisfied, the user may want to further search in subsequent searches, and if the user selects a word to be fed back to search, the difference between the obtained search result and the information that the user originally wanted to search is often large, so that an unexpected result is obtained. The sentences in the keyword set contain more contents compared with the words, and are more simplified compared with the search text, so that the requirement of further searching of the user can be well met compared with each word.
It should be noted that, because the chat session is relatively random, the search text does not necessarily include the subject, the predicate and the object at the same time, and thus does not necessarily form a complete sentence, and the keyword set does not necessarily include the sentence formed by the selected words.
Optionally, in order to provide the keyword set that meets the search habit of the user, in another embodiment of the present application, after the step S206 is executed to obtain the keyword set, the method may further include: calling a historical search record of a user, determining the historical search times of each vocabulary in the keyword set by using the historical search record, and removing the vocabulary with the historical search times less than the preset times from the keyword set. And after the words with less search times of the user are removed, feeding back all the words and sentences in the keyword set to the user for the user to select to perform subsequent information search.
According to the method for generating the keywords, the search text is obtained, and when the search text is determined to be of a question type, part-of-speech analysis is performed on each vocabulary of the search text, so that the part-of-speech of each vocabulary is obtained, syntactic analysis can be performed on each vocabulary by using a dependency syntactic algorithm based on the part-of-speech of each vocabulary, a syntactic relation between every two vocabularies with syntactic relation is obtained, finally, vocabularies corresponding to a subject, a predicate and an object are selected from each vocabulary of the search text based on the syntactic relation between every two vocabularies with syntactic relation, and a keyword set of sentences composed of each selected vocabulary and each selected vocabulary is generated by using the vocabularies corresponding to the subject, the predicate and the object. Therefore, the keywords are not extracted by using the entity dictionary any more aiming at the question, so that the extraction of the keywords is more flexible and is better suitable for the question, and the keywords meeting the requirements of the user can be accurately extracted from the question.
Another embodiment of the present application provides an apparatus for extracting a keyword, as shown in fig. 8, including the following units:
a first acquisition unit 801 for acquiring a search text.
And a judging unit 802, configured to judge whether the search text is a question type.
And a part-of-speech analysis unit 803, configured to, when the search text is a question type, perform part-of-speech analysis on each vocabulary of the search text to obtain a part-of-speech of each vocabulary.
And a syntax analysis unit 804, configured to perform syntax analysis on each vocabulary by using a dependency syntax algorithm based on the part of speech of each vocabulary, so as to obtain a syntax relationship between every two vocabularies having a syntax relationship.
An extracting unit 805, configured to select words corresponding to the subject, the predicate, and the object from the words of the search text based on a syntactic relationship between every two words having a syntactic relationship.
The generating unit 806 generates a keyword set using words corresponding to the subject, the predicate, and the object.
Wherein, the keyword set includes: each selected vocabulary, and/or sentences composed of the selected vocabularies.
Optionally, in an apparatus for extracting a keyword provided in another embodiment of the present application, the apparatus may further include:
and the first word segmentation unit is used for segmenting the search text to obtain each word of the search text.
And the first characteristic processing unit is used for respectively carrying out characteristic processing on each vocabulary to obtain a characteristic vector of each vocabulary.
And the classification unit is used for calling a pre-trained convolutional neural network model to process the feature vector of each vocabulary and determining whether the search text is of a question type.
Optionally, in the apparatus for extracting a keyword provided in another embodiment of the present application, a model training unit is further included. The model training unit, as shown in fig. 9, includes the following units:
a second obtaining unit 901, configured to obtain a plurality of question titles and a plurality of news titles.
And a sample unit 902, configured to use each question title and each news title as training sample data.
Each question title is used as positive training sample data, and each news title is used as negative training sample data.
And a second word segmentation unit 903, configured to segment words of each training sample data to obtain a sample vocabulary corresponding to each training sample data.
And a second feature processing unit 904, configured to perform feature processing on each sample vocabulary respectively to obtain a feature vector of each sample vocabulary.
And the input unit 905 is configured to input the feature vectors of each sample vocabulary corresponding to the training sample data into the convolutional neural network model, and perform calculation through the convolutional neural network model to obtain a classification result of the training sample data.
And the parameter adjusting unit 906 is configured to, when an error between the classification result of the training sample data and the label of the training sample data is greater than a preset threshold, perform parameter adjustment on the convolutional neural network model, and return to perform input of the feature vector of each sample vocabulary corresponding to the training sample data into the convolutional neural network model.
Wherein the label of the positive training sample data is 1, and the label of the negative training sample data is 0.
A first determining unit 907, configured to determine that training of the convolutional neural network model is completed when an error between the classification result of the training sample data and the label of the training sample data is not greater than a preset threshold.
Optionally, in an apparatus for extracting a keyword provided in another embodiment of the present application, the apparatus may further include:
and the merging unit is used for merging the vocabularies meeting the merging standard into one vocabulary if the vocabulary meeting the merging standard exists in the vocabularies corresponding to the selected subject.
Wherein the merging criteria are: the corresponding vocabulary of subject has a plurality of, and the position in the search text is continuous.
Optionally, in the apparatus for extracting a keyword provided in another embodiment of the present application, the apparatus may further include:
and the comparison unit is used for comparing the vocabulary corresponding to the predicate with the preset target predicate words.
And the first removing unit is used for removing the vocabulary corresponding to the predicate matched with the target predicate.
Optionally, in the apparatus for extracting a keyword provided in another embodiment of the present application, the apparatus may further include:
and the retrieval unit is used for retrieving the historical search records of the user.
And the second determining unit is used for determining the historical search times of each vocabulary in the keyword set by using the historical search records.
And the second removing unit is used for removing the vocabulary with the historical search frequency less than the preset frequency from the keyword set.
It should be noted that, the specific working processes of the units provided in the foregoing embodiments of the present application may be referred to accordingly, and the specific working processes of the corresponding units in the foregoing method embodiments are not described herein again.
A third aspect of the present application provides a computer storage medium for storing a computer program, which when executed, is configured to implement the keyword extraction method as described in any one of the above.
Computer storage media, including permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
Another embodiment of the present application provides an electronic device, as shown in fig. 10, including:
a memory 1001 and a processor 1002.
The memory 1001 is used for storing a program, and the processor 1002 is used for executing the program stored in the memory 1001, and when the program is executed, the method for extracting a keyword provided in any of the above embodiments is specifically implemented.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A keyword extraction method is characterized by comprising the following steps:
acquiring a search text;
judging whether the search text is of a question type;
if the search text is judged to be the question type, performing part-of-speech analysis on each vocabulary of the search text to obtain the part-of-speech of each vocabulary;
performing syntactic analysis on each vocabulary by using a dependency syntactic algorithm to obtain a syntactic relation between every two vocabularies with syntactic relations;
selecting vocabularies corresponding to the subject, the predicate and the object from all the vocabularies of the search text based on the syntactic relation between every two vocabularies with syntactic relation;
generating a keyword set by using vocabularies corresponding to the subject, the predicate and the object respectively; wherein the set of keywords comprises: the sentence is composed of words corresponding to the subject, the predicate and the object respectively, and the words corresponding to the subject, the predicate and the object respectively.
2. The method of claim 1, wherein the determining whether the search text is of a question type comprises:
performing word segmentation on the search text to obtain each word of the search text;
respectively carrying out feature processing on each vocabulary to obtain a feature vector of each vocabulary;
and calling a pre-trained convolutional neural network model to process the feature vector of each vocabulary, and determining whether the search text is of a question type.
3. The method of claim 2, wherein the training method of the convolutional neural network model comprises:
acquiring a plurality of problem titles and a plurality of news titles;
taking each problem title and each news title as training sample data; wherein each problem title is used as positive training sample data, and each news title is used as negative training sample data;
segmenting each training sample data to obtain a sample vocabulary corresponding to each training sample data;
respectively carrying out feature processing on each sample vocabulary to obtain a feature vector of each sample vocabulary;
inputting the feature vectors of the sample vocabularies corresponding to the training sample data into a convolutional neural network model, and calculating through the convolutional neural network model to obtain a classification result of the training sample data;
if the error between the classification result of the training sample data and the label of the training sample data is larger than a preset threshold value, performing parameter adjustment on the convolutional neural network model, and returning to execute the step of inputting the feature vector of each sample vocabulary corresponding to the training sample data into the convolutional neural network model; wherein the label of the positive training sample data is 1, and the label of the negative training sample data is 0;
and if the error between the classification result of the training sample data and the label of the training sample data is not greater than a preset threshold value, determining that the training of the convolutional neural network model is finished.
4. The method according to claim 1, wherein after selecting the vocabulary corresponding to the subject, the predicate and the object from the vocabularies of the search text, the method further comprises:
if the vocabulary meeting the merging standard exists in the vocabulary corresponding to the selected subject, merging the vocabulary meeting the merging standard into one vocabulary; wherein the merging criteria is: the corresponding vocabulary of subject is multiple, and the position in the search text is continuous.
5. The method according to claim 1, wherein after selecting the vocabulary corresponding to the subject, the predicate and the object from the vocabularies of the search text, the method further comprises:
comparing the vocabulary corresponding to the predicate with preset target predicate words;
and removing the vocabulary corresponding to the predicate matched with the target predicate.
6. The method according to claim 1, wherein after generating the keyword set using the vocabulary corresponding to the subject, the predicate, and the object, further comprising:
calling a historical search record of a user;
determining the historical search times of each vocabulary in the keyword set by using the historical search records;
and removing the vocabulary with the historical search times smaller than the preset times from the keyword set.
7. An extraction device of a keyword, characterized by comprising:
a first acquisition unit configured to acquire a search text;
the judging unit is used for judging whether the search text is of a question type or not;
the part of speech analysis unit is used for analyzing the part of speech of each vocabulary of the search text to obtain the part of speech of each vocabulary when the search text is of a question type;
the syntax analysis unit is used for carrying out syntax analysis on each vocabulary by utilizing a dependency syntax algorithm to obtain the syntax relation between every two vocabularies with the syntax relation;
the extraction unit is used for selecting vocabularies corresponding to the subject, the predicate and the object from all the vocabularies of the search text based on the syntactic relation between every two vocabularies with syntactic relation;
the generating unit is used for generating a keyword set by utilizing vocabularies corresponding to the subject, the predicate and the object respectively; wherein the set of keywords comprises: the sentence is composed of words corresponding to the subject, the predicate and the object respectively, and the words corresponding to the subject, the predicate and the object respectively.
8. The apparatus according to claim 7, wherein the determining unit comprises:
the first word segmentation unit is used for performing word segmentation on the search text to obtain each word of the search text;
the first characteristic processing unit is used for respectively carrying out characteristic processing on each vocabulary to obtain a characteristic vector of each vocabulary;
and the classification unit is used for calling a pre-trained convolutional neural network model to process the feature vector of each vocabulary and determining whether the search text is of a question type.
9. A computer storage medium storing a computer program which, when executed, implements the keyword extraction method according to any one of claims 1 to 6.
10. An electronic device, comprising:
a memory and a processor;
wherein the memory is used for storing programs;
the processor is configured to execute the program, and when the program is executed, the program is specifically configured to implement the keyword extraction method according to any one of claims 1 to 6.
CN202011342868.1A 2020-11-25 2020-11-25 Keyword extraction method and device, electronic equipment and computer storage medium Pending CN113392305A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011342868.1A CN113392305A (en) 2020-11-25 2020-11-25 Keyword extraction method and device, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011342868.1A CN113392305A (en) 2020-11-25 2020-11-25 Keyword extraction method and device, electronic equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN113392305A true CN113392305A (en) 2021-09-14

Family

ID=77616590

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011342868.1A Pending CN113392305A (en) 2020-11-25 2020-11-25 Keyword extraction method and device, electronic equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113392305A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186552A (en) * 2021-12-13 2022-03-15 北京百度网讯科技有限公司 Text analysis method, device and equipment and computer storage medium
CN114757187A (en) * 2022-04-27 2022-07-15 海信电子科技(武汉)有限公司 Intelligent device and effective semantic word extraction method
CN116361422A (en) * 2023-06-02 2023-06-30 深圳得理科技有限公司 Keyword extraction method, text retrieval method and related equipment
WO2024131633A1 (en) * 2022-12-19 2024-06-27 华为技术有限公司 Text display method, and electronic device and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186552A (en) * 2021-12-13 2022-03-15 北京百度网讯科技有限公司 Text analysis method, device and equipment and computer storage medium
CN114757187A (en) * 2022-04-27 2022-07-15 海信电子科技(武汉)有限公司 Intelligent device and effective semantic word extraction method
WO2024131633A1 (en) * 2022-12-19 2024-06-27 华为技术有限公司 Text display method, and electronic device and storage medium
CN116361422A (en) * 2023-06-02 2023-06-30 深圳得理科技有限公司 Keyword extraction method, text retrieval method and related equipment
CN116361422B (en) * 2023-06-02 2023-09-19 深圳得理科技有限公司 Keyword extraction method, text retrieval method and related equipment

Similar Documents

Publication Publication Date Title
US11816441B2 (en) Device and method for machine reading comprehension question and answer
US8073877B2 (en) Scalable semi-structured named entity detection
CN110096567B (en) QA knowledge base reasoning-based multi-round dialogue reply selection method and system
CN109460459B (en) Log learning-based dialogue system automatic optimization method
CN113392305A (en) Keyword extraction method and device, electronic equipment and computer storage medium
CN112347339B (en) Search result processing method and device
CN112487824B (en) Customer service voice emotion recognition method, device, equipment and storage medium
CN112307364B (en) Character representation-oriented news text place extraction method
JP2011118689A (en) Retrieval method and system
CN113569011A (en) Training method, device and equipment of text matching model and storage medium
CN111090771A (en) Song searching method and device and computer storage medium
KR102088357B1 (en) Device and Method for Machine Reading Comprehension Question and Answer
Sukumar et al. Semantic based sentence ordering approach for multi-document summarization
CN109284389A (en) A kind of information processing method of text data, device
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN110727769A (en) Corpus generation method and device, and man-machine interaction processing method and device
CN109992651B (en) Automatic identification and extraction method for problem target features
CN111046168A (en) Method, apparatus, electronic device, and medium for generating patent summary information
KR20200136636A (en) Morphology-Based AI Chatbot and Method How to determine the degree of sentence
CN113761104A (en) Method and device for detecting entity relationship in knowledge graph and electronic equipment
CN118113806A (en) Interpretable event context generation method for large model retrieval enhancement generation
CN112463918B (en) Information recommendation method, system, storage medium and terminal equipment
CN111949781B (en) Intelligent interaction method and device based on natural sentence syntactic analysis
CN114722267A (en) Information push method, device and server
CN116992111B (en) Data processing method, device, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40051761

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination