CN110866396A

CN110866396A - Method and device for determining main body of text designated information and computer storage medium

Info

Publication number: CN110866396A
Application number: CN201911069210.5A
Authority: CN
Inventors: 付骁弈; 张�杰
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2020-03-06
Anticipated expiration: 2039-11-05
Also published as: CN110866396B

Abstract

A main body determination method of text designated information comprises the steps of performing word segmentation on a target text; performing part-of-speech tagging on each word segmentation to obtain a part-of-speech tagging result of each word segmentation; determining at least one candidate main body according to the part-of-speech tagging result of each participle; dividing the target text according to each determined candidate subject to obtain a sample corresponding to each candidate subject; obtaining a vector V of each sample, and inputting the vector V into a first neural network trained in advance to determine whether the sample with the specified information exists; when the sample with the specified information exists, the candidate subject corresponding to the sample is the subject with the specified information. The method and the device can reduce manual labeling and reduce cost.

Description

Method and device for determining main body of text designated information and computer storage medium

Technical Field

The present disclosure relates to computer technologies, and in particular, to a method and an apparatus for determining a body of text-specific information, and a storage medium.

Background

The negative information main body judgment task is a common application in network public opinion monitoring work. The purpose of this is to determine whether a text to be analyzed contains negative information given the text, and if so, to give the name (or location in the text) of the subject to which the negative information relates.

The existing statistical learning method consumes a lot of cost on artificial feature construction, which is time-consuming and labor-consuming, and can cause the model to lack generalization capability on new patterns beyond the coded features.

The existing method for statistical learning to use a deep neural network avoids a tedious process of artificial feature construction by performing joint learning on subject identification and negative judgment, but the method needs a large amount of accurate sequence labeling samples, such as: with sequence labeling, this method requires manual labeling of each character of the text to be analyzed in the labeling stage, such as in fig. 2: the corresponding label of "B I I I I I I I I I I I I I I O O O O O O O O O O O O O O O O O O O O O O O O O O O O" for the Guangzhou development area financial holdings group company Limited thinks of a name with Guangzhou characteristics "is equal to the total length of the character string of the input text at this stage.

Disclosure of Invention

The application provides a method and a device for determining a main body of text designated information and a storage medium, which can achieve the aims of reducing manual labeling and reducing cost.

The application provides a method for determining a main body of text designated information, which comprises the following steps: performing word segmentation on the target text; performing part-of-speech tagging on each word segmentation to obtain a part-of-speech tagging result of each word segmentation; determining at least one candidate main body according to the part-of-speech tagging result of each participle; dividing the target text according to each determined candidate subject to obtain a sample corresponding to each candidate subject; obtaining a vector V of each sample, and inputting the vector V into a first neural network trained in advance to determine whether the sample with the specified information exists; when the sample with the specified information exists, the candidate subject corresponding to the sample is the subject with the specified information.

In an exemplary embodiment, the obtaining the vector V of each sample includes: the following operations are respectively carried out on each obtained sample: splitting according to the position of the candidate subject of the sample to obtain a first clause A and a second clause B; wherein the length of the first clause A is from the start position of the sample to the position where the candidate subject starts; the length of the second clause B is from the position where the candidate subject starts to the position where the sample ends; vectorizing each participle corresponding to the target text in the first clause A and the second clause B to respectively obtain a real-valued matrix MA of the first clause A and a real-valued matrix MB of the second clause B; and inputting the real-value matrix MA of the first clause A and the real-value matrix MB of the second clause into a second neural network to encode the first clause A and the second clause B, and acquiring the vector V of the sample.

In an exemplary embodiment, the inputting the real-valued matrix MA of the first clause a and the real-valued matrix MB of the second clause into the second neural network to encode the first clause a and the second clause B and obtain the vector V of the samples includes: inputting a real-value matrix MA of a first clause A and a real-value matrix MB of a second clause into a second neural network trained in advance, and coding the first clause A and the second clause B to obtain a coding vector VA of the first clause A and a coding vector VB of the second clause B; and splicing the obtained vectors VA and VB to obtain the vector V of the sample.

In an exemplary embodiment, the encoding the first clause a and the second clause B includes: the first clause a is coded from front to back and the second clause B is coded from back to front.

In an exemplary embodiment, the method further includes: and counting the subjects corresponding to the samples with the specified information, and merging and outputting the subjects.

In an exemplary embodiment, the determining at least one candidate subject according to the part-of-speech tagging result of each participle includes: and when the part-of-speech tagging result of the participle is a proper noun or a phrase consisting of the proper noun, determining the participle as a candidate subject.

The present application also provides a subject determination apparatus of text designation information, including: the part of speech tagging module is used for segmenting the target text; performing part-of-speech tagging on each word segmentation to obtain a part-of-speech tagging result of each word segmentation; the determining module is used for determining at least one candidate main body according to the part-of-speech tagging result of each participle; the sample dividing module is used for dividing the target text according to each determined candidate subject to obtain a sample corresponding to each candidate subject; the vector acquisition and analysis module is used for acquiring a vector V of each sample and inputting the vector V into a first neural network trained in advance so as to determine whether the sample with the specified information exists or not; when the sample with the specified information exists, the candidate subject corresponding to the sample is the candidate subject with the specified information.

In an exemplary embodiment, the vector obtaining and analyzing module is configured to obtain the vector V of each sample by: the vector obtaining module is used for respectively carrying out the following operations on each obtained sample: splitting according to the position of the candidate subject of the sample to obtain a first clause A and a second clause B; wherein the length of the first clause A is from the start position of the sample to the position where the candidate subject starts; the length of the second clause B is from the position where the candidate subject starts to the position where the sample ends; vectorizing each participle corresponding to the target text in the first clause A and the second clause B to respectively obtain a real-valued matrix MA of the first clause A and a real-valued matrix MB of the second clause B; and inputting the real-value matrix MA of the first clause A and the real-value matrix MB of the second clause into a second neural network to encode the first clause A and the second clause B, and acquiring the vector V of the sample.

The application also provides a device for directionally delivering contents, which comprises a processor and a memory, wherein the memory is stored with a program for directionally delivering the contents; the processor is used for reading the program for targeted delivery of content and executing the method of any one of the above items.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the above.

Compared with the related technology, the method has the advantages that the candidate main body can be obtained after word segmentation and part-of-speech tagging are carried out through the target text without manual coding features, labor cost is saved, and meanwhile, the method has better generalization performance on the premise of training by using enough data volume.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification, claims, and drawings.

Drawings

The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.

Fig. 1 is a flowchart of a method for determining a main body of text-specific information according to an embodiment of the present application;

FIG. 2 is a diagram illustrating an example of determining a body of text-specific information according to an embodiment of the present application;

fig. 3 is a block diagram of a body determination module of text-specific information according to an embodiment of the present application.

Detailed Description

The present application describes embodiments, but the description is illustrative rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

The technical solutions of the present application will be described in more detail below with reference to the accompanying drawings and embodiments.

As shown in fig. 1, an embodiment of the present invention provides a method for determining a main body of text-specific information, including the following steps:

s1, performing word segmentation on the target text; performing part-of-speech tagging on each word segmentation to obtain a part-of-speech tagging result of each word segmentation;

s2, determining at least one candidate subject according to the part-of-speech tagging result of each participle;

s3, dividing the target text according to each determined candidate subject to obtain a sample corresponding to each candidate subject;

s4, obtaining a vector V of each sample, and inputting the vector V into a first neural network trained in advance to determine whether the sample with the specified information exists; when the sample with the specified information exists, the candidate subject corresponding to the sample is the subject with the specified information.

In one exemplary embodiment, the first neural network is a feed-forward neural network.

Word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain specification. Part-Of-Speech tagging (POS tagging), also known as grammatical tagging or word-category disambiguation, is a text data processing technique in corpus linguistics (Speech linguistics) that tags parts Of Speech Of words in a corpus by their meaning and context.

As shown in FIG. 2, "Guangzhou development area finance holdings group, Inc. thought a name with Guangzhou characteristics for him: a security in a knowledge city. "this sentence is used as the target text, and the segmentation and part-of-speech tagging are performed, and the result is shown in fig. 2, where NR represents proper nouns; NN denotes other nouns; JJ denotes adjectives or ordinal words; PN represents pronouns; VV represents verbs, etc., which are common abbreviations for computer part-of-speech tagging and will not be described further herein. In this embodiment, a stanford CoreNLP tagging system is adopted for part-of-speech tagging.

In an exemplary embodiment, in step S2, the determining at least one candidate subject according to the part-of-speech tagging result of each participle includes: and when the part-of-speech tagging result of the participle is a proper noun or a phrase consisting of the proper noun, determining the participle as a candidate subject.

Illustratively, according to the result of part-of-speech tagging, for proper nouns: NR, other nouns: the word group formed by adjacent NNs (combination method includes, but not limited to, NRNN, NNNR, etc.) is used as the candidate of negative information. For example, as shown in fig. 2, according to the labeling result, "the" city of knowledge securities "," guangzhou specials "and" guangzhou development area financial stock keeping group "are used as candidate subjects, and therefore three samples should be divided. In this embodiment, each sample has the same content, and is a name that the financial holdings group limited company in the guangzhou development area thinks a considerable Guangzhou characteristic: a security in a knowledge city. ".

In an exemplary embodiment, in step S4, the obtaining the vector V of each sample includes: the following operations are respectively carried out on each obtained sample:

s41, splitting according to the position of the candidate subject of the sample to obtain a first clause A and a second clause B; wherein the length of the first clause A is from the start position of the sample to the position where the candidate subject starts; the length of the second clause B is from the position where the candidate subject starts to the position where the sample ends;

s42, vectorizing each participle corresponding to the target text in the first clause A and the second clause B to respectively obtain a real-value matrix M of the first clause A_AAnd a second clause B real-valued matrix M_B；

S43, converting the real value matrix M of the first clause A_AAnd a real-valued matrix M of the second clause_BAnd inputting a second neural network to encode the first clause A and the second clause B to obtain a vector V of the sample.

In one exemplary embodiment, the encoding the first clause a and the second clause B in step S43 includes:

s431, coding the first clause A and the second clause B to obtain a coding vector V of the first clause A_AAnd the code vector V of the second clause B_B；

S432, obtaining the vector V_AAnd V_BAnd splicing to obtain a vector V of the sample.

In an exemplary embodiment, in step S431, the real-valued matrix M of the first clause a_AAnd a real-valued matrix M of the second clause_BInputting a pre-trained second neural network, and encoding the first clause a and the second clause B, further comprising:

and inputting a real-value matrix MA of the first clause A and a real-value matrix MB of the second clause into a pre-trained second neural network, and coding the first clause A from front to back and coding the second clause B from back to front.

Illustratively, the second neural network is a recurrent neural network, including but not limited to RNN, GRU, LSTM, and the like.

As shown in FIG. 2, "Guangzhou development area finance holdings group, Inc. thought a name with Guangzhou characteristics for him: a security in a knowledge city. In this sentence, the name of Guangzhou speciality is taken as an example of a candidate subject, and the corresponding A clause is a considerable idea for the financial governing group, Inc. in the Guangzhou development area. "; clause B is the name of "Guangzhou speciality: a security in a knowledge city. ". Furthermore, by searching the pre-training word vector of each word of the target text and using the real-valued vector as the representation of the word or phrase, the real-valued matrix representation of A, B clauses in each sample is obtained and is recorded as the matrix M_AAnd M_B. Since the candidate subject may be obtained by combining a plurality of word phrases, when the word vector is not found, the average value of the word vectors corresponding to a plurality of words included in the word group should be used instead.

M to be obtained_AAnd M_BAnd respectively coding the first clause A and the second clause B from front to back and from back to front by inputting a cyclic neural network, and converting the obtained codes into a new semantic space through attention mechanism combination so as to capture sentences with long-distance dependency. Obtaining the code vector of the first clause A and the second clause B as V_AAnd V_B。

The obtained feature vector V_AAnd V_BSplicing is performed to obtain a vector representation V of the entire sample. V is input into the feedforward neural network, using Softmax as the activation function of the output layer. The output layer outputs three different real values, which respectively represent: tag 1 (there is negative information related to the entity); tag-1 (no negative information or involvement of the entity); label 0 (the candidate phrase does not constitute an entity). And selecting the label corresponding to the maximum real value as a final judgment result by comparing the sizes of the three real values. "

For example, "Guangzhou development area finance holdings group, Inc. wants a name with Guangzhou specialization: a security in a knowledge city. "this statement, as exemplified by" name of Guangzhou specials ", this step should output tag 0; while tag-1 should be output in the case of "city of knowledge securities".

In an exemplary embodiment, the method for determining the main body of the text-specific information further includes the steps of: and S5, counting the subjects corresponding to the samples with the specified information, merging and outputting.

Illustratively, samples obtained by splitting each text to be analyzed are summarized, and if two or more subjects with negative information exist in the target text, the two or more subjects are jointly output as a result.

For example, "Guangzhou development area finance holdings group, Inc. wants a name with Guangzhou specialization: a security in a knowledge city. "in this case, the result to be output by this step is: { text: "Guangzhou development area finance holdings group, Inc. thinks of a name with Guangzhou characteristics for it: a security in a knowledge city. ", label: -1, entity: "Guangzhou development area finance holdings group, Inc. | knowledge City securities", negative _ entity: "}.

As shown in fig. 3, an embodiment of the present invention further provides a device for determining a main body of text-specific information, including the following modules:

a part-of-speech tagging module 10, configured to perform word segmentation on a target text; performing part-of-speech tagging on each word segmentation to obtain a part-of-speech tagging result of each word segmentation;

a determining module 20, configured to determine at least one candidate subject according to a part-of-speech tagging result of each participle;

the sample dividing module 30 is configured to divide the target text according to each determined candidate subject, so as to obtain a sample corresponding to each candidate subject;

the vector acquisition and analysis module 40 is configured to acquire a vector V of each sample, and input the vector V into a first neural network trained in advance to determine whether a sample with the specified information exists; when the sample with the specified information exists, the candidate subject corresponding to the sample is the subject with the specified information.

The vector obtaining and analyzing module 40 is configured to obtain a vector V of each sample, where the vector V refers to: the vector obtaining module is used for respectively carrying out the following operations on each obtained sample:

the vector acquisition and analysis module 40 splits the sample according to the position of the candidate subject to obtain a first clause a and a second clause B; wherein the length of the first clause A is from the start position of the sample to the position where the candidate subject starts; the length of the second clause B is from the position where the candidate subject starts to the position where the sample ends;

a vector obtaining and analyzing module 40, which is used for vectorizing each participle corresponding to the target text in the first clause a and the second clause B, and respectively obtaining a real-valued matrix M of the first clause a_AAnd a second clause B real-valued matrix M_B；

The vector obtaining and analyzing module 40 is used for obtaining the real-valued matrix M of the first clause A_AAnd a real-valued matrix M of the second clause_BAnd inputting a second neural network to encode the first clause A and the second clause B to obtain a vector V of the sample.

The invention also provides a device for directionally delivering the content, which comprises a processor and a memory, wherein the memory is stored with a program for directionally delivering the content; the processor is used for reading the program for targeted delivery of content and executing the method of any one of the above items.

The invention also provides a computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, is adapted to perform the method of any of the above.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A method for determining a subject of text-specific information, comprising:

performing word segmentation on the target text; performing part-of-speech tagging on each word segmentation to obtain a part-of-speech tagging result of each word segmentation;

determining at least one candidate main body according to the part-of-speech tagging result of each participle;

dividing the target text according to each determined candidate subject to obtain a sample corresponding to each candidate subject;

obtaining a vector V of each sample, and inputting the vector V into a first neural network trained in advance to determine whether the sample with the specified information exists;

when the sample with the specified information exists, the candidate subject corresponding to the sample is the subject with the specified information.

2. The method of claim 1, wherein said obtaining a vector V for each sample comprises: the following operations are respectively carried out on each obtained sample:

splitting according to the position of the candidate subject of the sample to obtain a first clause A and a second clause B; wherein the length of the first clause A is from the start position of the sample to the position where the candidate subject starts; the length of the second clause B is from the position where the candidate subject starts to the position where the sample ends;

vectorizing each participle corresponding to the target text in the first clause A and the second clause B to respectively obtain a real-value matrix M of the first clause A_AAnd a second clause B real-valued matrix M_B；

Real-valued matrix M of first clause A_AAnd a real-valued matrix M of the second clause_BAnd inputting a second neural network to encode the first clause A and the second clause B to obtain a vector V of the sample.

3. The method of claim 2, wherein the real-valued matrix M of the first clause a_AAnd a real-valued matrix M of the second clause_BInputting a second neural network to encode the first clause A and the second clause B, and acquiring a vector V of the sample, wherein the vector V comprises:

real-valued matrix M of first clause A_AAnd a second clause real value matrix M_BInputting a pre-trained second neural network, and coding the first clause A and the second clause B to obtain a coding vector V of the first clause A_AAnd the code vector V of the second clause B_B；

The obtained vector V_AAnd V_BAnd splicing to obtain a vector V of the sample.

4. The method of claim 3, wherein said encoding the first clause A and the second clause B comprises:

the first clause a is coded from front to back and the second clause B is coded from back to front.

5. The method of claim 1, wherein the method further comprises: and counting the subjects corresponding to the samples with the specified information, and merging and outputting the subjects.

6. The method of claim 1, wherein determining at least one candidate subject based on the part-of-speech tagging results for each participle comprises:

and when the part-of-speech tagging result of the participle is a proper noun or a phrase consisting of the proper noun, determining the participle as a candidate subject.

7. A subject determination apparatus for text-specifying information, comprising:

the part of speech tagging module is used for segmenting the target text; performing part-of-speech tagging on each word segmentation to obtain a part-of-speech tagging result of each word segmentation;

the determining module is used for determining at least one candidate main body according to the part-of-speech tagging result of each participle;

the sample dividing module is used for dividing the target text according to each determined candidate subject to obtain a sample corresponding to each candidate subject;

the vector acquisition and analysis module is used for acquiring a vector V of each sample and inputting the vector V into a first neural network trained in advance so as to determine whether the sample with the specified information exists or not; when the sample with the specified information exists, the candidate subject corresponding to the sample is the candidate subject with the specified information.

8. The apparatus of claim 7, wherein the vector obtaining and analyzing module is configured to obtain the vector V for each sample by: the vector obtaining module is used for respectively carrying out the following operations on each obtained sample:

9. An apparatus for targeted delivery of content, comprising a processor and a memory, wherein the memory has stored therein a program for targeted delivery of content; the processor is used for reading the program for targeted delivery and executing the method of any one of claims 1-6.

10. A computer storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the method according to any one of claims 1-6.