Nothing Special   »   [go: up one dir, main page]

CN111859862A - Text data labeling method and device, storage medium and electronic device - Google Patents

Text data labeling method and device, storage medium and electronic device Download PDF

Info

Publication number
CN111859862A
CN111859862A CN202010712345.5A CN202010712345A CN111859862A CN 111859862 A CN111859862 A CN 111859862A CN 202010712345 A CN202010712345 A CN 202010712345A CN 111859862 A CN111859862 A CN 111859862A
Authority
CN
China
Prior art keywords
data
labeling
text
category
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010712345.5A
Other languages
Chinese (zh)
Other versions
CN111859862B (en
Inventor
韩俊明
赵培
马志芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haier Uplus Intelligent Technology Beijing Co Ltd
Original Assignee
Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haier Uplus Intelligent Technology Beijing Co Ltd filed Critical Haier Uplus Intelligent Technology Beijing Co Ltd
Priority to CN202010712345.5A priority Critical patent/CN111859862B/en
Publication of CN111859862A publication Critical patent/CN111859862A/en
Application granted granted Critical
Publication of CN111859862B publication Critical patent/CN111859862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text data labeling method and device, a storage medium and an electronic device. Wherein, the method comprises the following steps: acquiring a text to be marked; labeling data of the text by a first hierarchical serial processing mode to obtain first labeled data, and labeling data of the text by a second hierarchical parallel processing mode to obtain second labeled data; labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data; and determining the third labeling data and the fourth labeling data as the labeling data of the text, combining the two labeling data modes, comparing the data with the difference generated by the two modes, and performing secondary processing, thereby solving the technical problem of low accuracy of data labeling on the text in the prior art.

Description

Text data labeling method and device, storage medium and electronic device
Technical Field
The invention relates to the field of data processing, in particular to a text data labeling method and device, a storage medium and an electronic device.
Background
In natural language processing, a large amount of labeled data is needed, generally, the accuracy of data labeling can be used by a model with the accuracy of more than 90%, but for some problems, such as the household appliance industry, the stability of the model needs to be ensured, and for the existing data, the accuracy of 100% needs to be ensured. However, the artificially labeled data still has an error rate of about 10%, and for the labeling of the errors, human and material resources are still required to be input in the later period to perform material proofreading and correction work, and then the examination and labeling are performed again.
In the prior art, a traditional language processing algorithm is used for performing natural language annotation verification analysis processing.
In the serial process of layer-by-layer processing, the complete natural language is parsed in a logical order from broad to precise. One obvious drawback of this type of solution is the accumulation of errors: the error generated by the upper layer is not timely extracted, but enters the next layer as input to continue the identification processing, and the identification result is inherited from the upper layer to the next layer, so that a large amount of unnecessary detection and identification work is caused, and a certain amount of resource waste is brought.
In the parallel processing process of layering respective processing, each layer has respective identification unit and standard, and the identification between layers is not influenced mutually, thereby effectively solving the problem of error propagation. However, the identification method that is separated from the inter-level association destroys the strong logic of the natural language, and may cause the situation that the analysis methods in different fields disassemble the same sentence, and the analysis result may be unsatisfactory.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for data annotation of a text, a storage medium and an electronic device, and at least solves the technical problem that in the prior art, the accuracy of data annotation of the text is low.
According to an aspect of the embodiments of the present invention, there is provided a method for annotating data of a text, including: acquiring a text to be labeled, wherein the text at least comprises a target object to be labeled; labeling the text with data in a first hierarchical serial processing mode to obtain first labeled data, and labeling the text with data in a second hierarchical parallel processing mode to obtain second labeled data; labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data; and determining the third annotation data and the fourth annotation data as the annotation data of the text.
According to another aspect of the embodiments of the present invention, there is also provided a data annotation device for text, including: the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a text to be labeled, and the text at least comprises a target object to be labeled; the first labeling unit is used for labeling the text with data in a first hierarchical layer-by-layer serial processing mode to obtain first labeled data, and labeling the text with data in a second hierarchical parallel processing mode to obtain second labeled data; the second labeling unit is used for labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data; a determining unit, configured to determine the third annotation data and the fourth annotation data as the annotation data of the text, where the fourth annotation data is annotation data of the same part as the first annotation data and the second annotation data.
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, where the computer program is configured to execute the data annotation method of the text when running.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for annotating text data through the computer program.
In the embodiment of the invention, a text to be labeled is obtained, wherein the text at least comprises a target object to be labeled; labeling data of the text by a first hierarchical serial processing mode to obtain first labeled data, and labeling data of the text by a second hierarchical parallel processing mode to obtain second labeled data; labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data; the third labeling data and the fourth labeling data are determined as the labeling data of the text, so that the purposes of combining two labeling data modes, comparing the data with the difference generated by the two modes and then performing secondary processing are achieved, the technical effect of improving the accuracy of the text labeling data is achieved, and the technical problem that in the prior art, the accuracy of data labeling on the text is low is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a diagram illustrating an application environment of an alternative text data annotation method according to an embodiment of the present invention;
FIG. 2 is a flow chart of an alternative method for annotating data with text in accordance with an embodiment of the present invention;
FIG. 3 is a flow diagram of an alternative first processing of text according to an embodiment of the invention;
FIG. 4 is a flowchart of an alternative second text processing method according to an embodiment of the present invention
FIG. 5 is a schematic diagram of an alternative text semantic hierarchy according to an embodiment of the present invention;
FIG. 6 is a flowchart of an alternative method for verifying labeled data based on a multi-level multi-model according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of an alternative text data annotation device according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device for implementing an alternative text data annotation method according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of the embodiment of the present invention, a text data annotation method is provided, and optionally, as an optional implementation manner, the text data annotation method may be applied, but not limited to, to a text data annotation system in a hardware environment as shown in fig. 1, where the text data annotation system may include, but is not limited to, the terminal device 102, the network 110, and the server 112.
The terminal device 102 may include, but is not limited to: a human-computer interaction screen 104, a processor 106 and a memory 108. The man-machine interaction screen 104 is used for acquiring a man-machine interaction instruction through a man-machine interaction interface and presenting the text to be annotated; the processor 106 is configured to label the text with data in response to the human-computer interaction instruction. The memory 108 is used for storing the text to be labeled and the labeled data after the text labeling. Here, the server may include but is not limited to: the system comprises a database 114 and a processing engine 116, wherein the processing engine 116 is used for calling a text to be labeled stored in the database 114 and acquiring the text to be labeled, wherein the text at least comprises a target object to be labeled; labeling data of the text by a first hierarchical serial processing mode to obtain first labeled data, and labeling data of the text by a second hierarchical parallel processing mode to obtain second labeled data; labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data; the third labeling data and the fourth labeling data are determined as the labeling data of the text, so that the purposes of combining two labeling data modes, comparing the data with the difference generated by the two modes and then performing secondary processing are achieved, the technical effect of improving the accuracy of the text labeling data is achieved, and the technical problem that in the prior art, the accuracy of data labeling on the text is low is solved.
The specific process comprises the following steps: a human-computer interaction screen 104 in the terminal device 102 displays a text to be annotated (shown in fig. 1 as a target object (person a) included in the text). In steps S102-S110, the text to be annotated is obtained, and the text is sent to the server 112 through the network 110. Labeling data of the text in a server 112 by a first hierarchical serial processing mode layer by layer to obtain first labeled data, and labeling data of the text by a second hierarchical parallel processing mode to obtain second labeled data; labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data; and determining the third annotation data and the fourth annotation data as the annotation data of the text. And then returns the determined result to the terminal device 102.
Then, in step S102-S110, the terminal device 102 labels the text with data in a first hierarchical, layer-by-layer, serial processing manner to obtain first labeled data, and labels the text with data in a second hierarchical, parallel processing manner to obtain second labeled data; labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data; and determining the third annotation data and the fourth annotation data as the annotation data of the text.
Optionally, in this embodiment, the data annotation method for the text can be applied, but not limited to, in the server 112, for assisting the application client in annotating the data of the text to be annotated. The application client may be but not limited to running the terminal device 102, and the terminal device 102 may be but not limited to a mobile phone, a tablet computer, a notebook computer, a PC, and other terminal devices supporting running the application client. The server 112 and the terminal device 102 may implement data interaction through a network, which may include but is not limited to a wireless network or a wired network. Wherein, this wireless network includes: bluetooth, WIFI, and other networks that enable wireless communication. Such wired networks may include, but are not limited to: wide area networks, metropolitan area networks, and local area networks. The above is merely an example, and this is not limited in this embodiment.
Optionally, as an optional implementation manner, as shown in fig. 2, the text data annotation method includes:
step S202, a text to be labeled is obtained, wherein the text at least comprises a target object to be labeled.
Step S204, labeling the text with data through a first hierarchical layer-by-layer serial processing mode to obtain first labeled data, and labeling the text with data through a second hierarchical parallel processing mode to obtain second labeled data.
Step S206, labeling the part of the first labeling data different from the second labeling data according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data identical to the part of the second labeling data to obtain fourth labeling data.
In step S208, the third annotation data and the fourth annotation data are determined as the annotation data of the text.
Optionally, in this embodiment, the text may include, but is not limited to, a document text, a picture text, and the like. Where the text is document text, data labeling of the document text may include, but is not limited to, labeling images in the document, and labeling words, and the format of the document may include, but is not limited to, word format, pdf format, and the like. In the case that the text is a picture text, labeling the picture text may include, but is not limited to, performing data labeling on an object in the picture, for example, performing data labeling on a person, an animal, and the like in the picture, where a format of the picture text is not particularly limited.
It should be noted that, in the case where the text is a document text, the target objects in the document text may include but are not limited to: words, phrases, long sentences, etc. In the case where the text is pictorial text, the target objects may include, but are not limited to: humans, animals, etc.
It should be noted that, before the text to be labeled is obtained, partial data of the text may be labeled, that is, the text to be labeled may include, but is not limited to, labeled data that is not labeled at all, and text in which partial labeled data exists.
It can be seen that, in this embodiment, the data annotation of the text may be applied to data annotation of a document text and/or an image text, and the annotated data is input into a neural network, so as to identify the document text or identify the image text.
Optionally, in an embodiment, the tagging data of the text by the first processing method to obtain the first tagged data includes:
s1, determining that the text corresponds to a first category, and inputting the text into a first layer of a first neural network according to the first category to obtain labeled data corresponding to the first category;
and S2, inputting the labeling data corresponding to the first class into the second layer of the first neural network to obtain the first labeling data.
In practical application, the first processing mode includes but is not limited to verification and identification by using a TextCNN algorithm model, and the algorithm is high in efficiency and suitable for analysis processing work of a large amount of data. As shown in fig. 3, a flow chart of a first text processing method.
As shown in fig. 3, the processes are classified layer by layer. From the beginning of execution, the input text is divided into different categories (corresponding to the first category of the text) from one category level, and each category is provided with a respective set of domains. And the text after the category processing enters the field level division under the category with the processed category label. In the first field layer, the same processing mode as the previous layer processing method is adopted, each field corresponds to a group of intention sets, and after the fields are divided, the corresponding intention set layers are entered for continuously carrying out division verification. And finally forming a processed text labeled with labels of different levels after the verification and labeling of all levels are completed. The processed text results have certain logicality among the label labels in different layers and have obvious restriction relation between the upper layer and the lower layer.
The first processing mode is suitable for processing the condition that a plurality of data labels are processed, the labels are shunted by layer-by-layer processing, and the calculation speed can be effectively increased.
Optionally, in this embodiment, the tagging data of the text by the second processing method to obtain the second tagged data may include:
s1, determining the second category and the third category corresponding to the text according to different classification modes;
s2, inputting the label data into a second neural network according to the second category to obtain the label data corresponding to the second category, and inputting the label data into a third neural network according to the third category to obtain the label data corresponding to the third category;
and S3, processing the labeling data corresponding to the second category and the labeling data corresponding to the third category according to preset conditions to obtain second labeling data.
In practical application, the second processing mode is a text verification and annotation method for respective independent and parallel processing of each layer, and may include, but is not limited to, implementation by robert algorithm. As shown in fig. 4, a flow chart of a second text processing method.
As shown in fig. 4, unlike the first processing method mentioned above, the second processing method splits different language hierarchies into independent tag sets, such as a category set, a domain set, and an intention set. Each set comprises all the labels of the level, and the number of the labels is increased compared with the prior method set in a large scale.
The second processing method inputs the input text material to each layer to perform parallel analysis processing, and a plurality of layers can be executed simultaneously to obtain analysis results. Such a processing manner can improve the processing efficiency of the system. After the hierarchical processing, the processing results of integral multiples of the original input materials can be obtained, and each result is provided with a hierarchical labeling result. And after all the output after the primary processing is obtained, combining the processing results of different layers of the same language text material, and integrating to obtain a complete processing result. After all the language materials are combined, the processing procedure forms a processing result similar to the first processing mode, and the processing result comprises analysis information of various aspects such as categories, intentions, fields and the like.
This second approach weakens the constraining logical relationship between the hierarchies: each label is selected from the whole large set for marking and checking, the restriction relation between an upper layer and a lower layer is avoided, errors of the upper layer cannot be transmitted to the lower layer, and marking boundaries among different types, fields and intentions are broken simultaneously.
Optionally, in this embodiment, after two different processing procedures, namely layering processing and layer-by-layer parallel processing, two different processing results for the same set of input data are finally obtained. Both sets of results are machine checked, so that a certain check error exists. In order to specifically eliminate and correct these errors, a comparison of the results is required. There are two criteria for the result alignment:
1) and comparing the verification results of the same language material, and judging that the verification result of the data is correct if the verification judgment positions of the classification position, the field position, the intention position and the like of the result are completely the same. Otherwise the data is called "bad data". And inputting all bad data generated in the comparison process into a bad data storage database.
2) For the results with the same comparison result, in order to ensure the accuracy of the data, we should perform a judgment: and extracting and calculating the discrimination probability of each label, carrying out weighted average on the probability values obtained by the two methods, and adding the data with the weighted value less than 0.9 into a database of bad data.
Two batches of data can be generated by the two comparison methods, one batch of data can be defaulted as data qualified by marking verification, and secondary processing is not needed. And for the data in the bad data database, the data needs to be input into a manual verification system for manual secondary verification judgment (which is equivalent to marking the data according to a preset rule).
Through the embodiment provided by the application, the text to be labeled is obtained, wherein the text at least comprises one target object to be labeled; labeling data of the text by a first hierarchical serial processing mode to obtain first labeled data, and labeling data of the text by a second hierarchical parallel processing mode to obtain second labeled data; labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data; the third labeling data and the fourth labeling data are determined as the labeling data of the text, so that the purposes of combining two labeling data modes, comparing the data with the difference generated by the two modes and then performing secondary processing are achieved, the technical effect of improving the accuracy of the text labeling data is achieved, and the technical problem that in the prior art, the accuracy of data labeling on the text is low is solved.
As an alternative embodiment, after determining the third annotation data and the fourth annotation data as the annotation data of the text, the method may further include:
inputting the labeling data of the text into a target neural network model, and outputting the probability of executing target operation on a target object;
in the event that the probability is greater than a predetermined threshold, the target operation is performed in response to an instruction to the target object.
Wherein executing the target operation in response to the instruction of the target object comprises: responding to the instruction of the target object to execute the operation of deleting the labeled data of the target object; or; responding to the instruction of the target object to execute the operation of adding the label data of the target object; or the operation of updating the annotation data of the target object is executed in response to the instruction of the target object.
As an optional embodiment, the present application further provides a labeled data verification method based on a multi-level and multi-model.
In the analysis process of the natural language, the semantic analysis of the natural language is roughly divided into a plurality of layers according to the rules and the internal logic association of the natural language, the layers are wide and fine, and a sentence can be completely split into forms which are convenient for machine understanding and representation. As shown in fig. 5, a text semantic hierarchy diagram is shown in fig. 5, hierarchies such as category, field, intention, and the like are shown, the category includes the field, the same field is divided into different categories, and after several times of such division processing, the text is analyzed and dissected into a keyword form. This hierarchical partitioning is the basis for the generation of two different ways of text processing.
FIG. 6 is a flow chart of a labeled data verification method based on a multi-level and multi-model. The method comprises the following specific steps:
step 1, inputting texts to be processed into two text processing systems as input data respectively, wherein text samples are subjected to basic processing such as manual labeling before input.
And 2, respectively processing the same batch of input text data to be processed by the two text processing systems, thereby generating two input text processing processes and finally obtaining two processing results. The two processing procedures are respectively hierarchical layer-by-layer serial processing and a parallel processing method without distinguishing layers, and are similar to the first processing mode and the second processing mode.
And 3, comparing the processing results generated in the two text processing processes, wherein the comparison standard comprises information such as the analysis result and the position of each layer, and distinguishing the compared data into two parts, namely a completely consistent comparison result and a different comparison result.
And 4, respectively processing the data generated after comparison:
1) and for the data with the same comparison result, the default is that the text analysis processing is completely correct and the data is directly used as the output result of the text processing.
2) And extracting data with different comparison results, entering a manual processing and checking link, and analyzing and labeling the text data by special practitioners. The manually marked data is used as the output result of another part.
And 5, integrating the two generated text processing results, and outputting the integrated text processing results as final results of the scheme. The result is analyzed and processed by a machine twice, is compared and checked once, and is partially checked manually, so that the method has higher accuracy.
Through the embodiments provided by the application, the following benefits can be obtained:
1. the advantages of the manual processing mode and the machine processing mode are combined. The manual processing mode has high accuracy, and the result obtained by analyzing the text from the human viewpoint is more consistent with the common cognition of human beings on natural language, but the scheme consumes considerable time and human resources. The machine processing utilizes the traditional text processing algorithm to carry out the verification analysis of the labels, and the analysis efficiency is high. The two modes are combined, and human resources are directionally input into parts which are difficult to analyze correctly by the machine for processing, so that the accuracy of an analysis result can be ensured, and the analysis processing efficiency can be improved.
2. The machine processing mode applies two different processes. The hierarchical layer-by-layer processing process ensures the internal logic association of text analysis, and the hierarchical processing process is a semantic result obtained by synchronously labeling each layer. The outputs of the two analysis processes are compared and the exact same part can be output as the correct result. The comparison scheme enables the machine processing result to be more reliable, meanwhile, the part which is difficult to identify and process by the machine can be screened out to enter the manual processing part, and the efficiency is improved.
3. More reliable comparison and discrimination mechanism. When comparing the analysis results generated by two different machine methods, we have two criteria to judge the quality of the data. One is the traditional bit-by-bit comparison, each judged element is compared one by one, and the data is determined to be bad data if different judgment results exist. In addition, the same data is judged for each judgment bit, comprehensive judgment is carried out according to the probability result of each identification position, and the probability results of the two processes are weighted and averaged to obtain a judgment result. This is because the machine algorithm identifies some unavoidable errors, and even if the double algorithm is adopted for comprehensive identification, some identified erroneous data still exist. Therefore, we compare the result of weighted averaging with the tolerable discrimination probability of 0.9, and add data below 0.9 to the bad data set. The two discrimination mechanisms can make the discrimination result of the machine more reliable.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
According to another aspect of the embodiment of the present invention, there is also provided a text data annotation device for implementing the text data annotation method. As shown in fig. 7, the apparatus includes: an acquisition unit 71, a first labeling unit 73, a second labeling unit 75, and a determination unit 77.
The obtaining unit 71 is configured to obtain a text to be labeled, where the text at least includes one target object to be labeled.
The first labeling unit 73 is configured to label data to the text in a hierarchical, layer-by-layer, serial first processing manner to obtain first labeled data, and label data to the text in a parallel second processing manner without distinguishing hierarchies to obtain second labeled data.
The second labeling unit 75 is configured to label, according to a preset rule, a part where the first labeling data and the second labeling data have a difference, so as to obtain third labeling data, and label a part where the first labeling data and the second labeling data are the same, so as to obtain fourth labeling data.
A determining unit 77 for determining the third annotation data and the fourth annotation data as the annotation data of the text.
Optionally, in this embodiment, the first labeling unit 73 may include:
the first obtaining module is used for determining a first category corresponding to the text, inputting the text into a first layer of a first neural network according to the first category, and obtaining marking data corresponding to the first category;
and the second obtaining module is used for inputting the labeling data corresponding to the first class into the second layer of the first neural network to obtain the first labeling data.
Optionally, in this embodiment, the first labeling unit 73 may include:
the determining module is used for determining a second category and a third category corresponding to the text according to different classification modes;
a third obtaining module, configured to obtain labeling data corresponding to the second category by inputting the labeling data to a second neural network according to the second category, and obtain labeling data corresponding to the third category by inputting the labeling data to a third neural network according to the third category;
and the fourth obtaining module is used for processing the labeling data corresponding to the second category and the labeling data corresponding to the third category according to preset conditions to obtain second labeling data.
Through the embodiment provided by the application, the obtaining unit 71 obtains a text to be labeled, wherein the text at least comprises a target object to be labeled; the first labeling unit 73 labels data of the text by a first processing mode of hierarchical layer-by-layer serial processing to obtain first labeled data, and labels data of the text by a second processing mode of parallel processing without distinguishing hierarchies to obtain second labeled data; the second labeling unit 75 labels a part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labels a part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data; the determination unit 77 determines the third annotation data and the fourth annotation data as the annotation data of the text. The method and the device achieve the purpose of combining two labeling data modes and comparing the data with the difference generated by the two labeling data modes and then performing secondary processing, thereby achieving the technical effect of improving the accuracy of the text labeling data and further solving the technical problem that in the prior art, the accuracy of data labeling on the text is low.
As an alternative embodiment, the apparatus may further include:
the obtaining unit is used for inputting the labeling data of the text into the target neural network model after the third labeling data and the fourth labeling data are determined as the labeling data of the text, and outputting the probability of executing target operation on the target object;
and a response unit for performing the target operation in response to the instruction to the target object in the case where the probability is greater than a predetermined threshold.
Wherein, the response unit includes:
the first response module is used for responding to the instruction of the target object to execute the operation of deleting the labeled data of the target object; or;
the second response module is used for responding to the instruction of the target object to execute the operation of adding the label data of the target object; or
And the third response module is used for responding to the instruction of the target object to execute the operation of updating the marking data of the target object.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the method for annotating data of a text, as shown in fig. 8, the electronic device includes a memory 802 and a processor 804, the memory 802 stores a computer program, and the processor 804 is configured to execute the steps in any one of the above method embodiments through the computer program.
Optionally, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of a computer network.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a text to be labeled, wherein the text at least comprises a target object to be labeled;
s2, labeling the text with data through a first hierarchical layer-by-layer serial processing mode to obtain first labeled data, and labeling the text with data through a second hierarchical parallel processing mode to obtain second labeled data;
s3, labeling the part of the first labeling data which is different from the second labeling data according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data which is the same as the second labeling data to obtain fourth labeling data;
s4, the third annotation data and the fourth annotation data are determined as the annotation data of the text.
Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 8 is a diagram illustrating a structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.
The memory 802 may be used to store software programs and modules, such as program instructions/modules corresponding to the text data annotation method and apparatus in the embodiments of the present invention, and the processor 804 executes various functional applications and data processing by running the software programs and modules stored in the memory 802, that is, implements the text data annotation method. The memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 802 can further include memory located remotely from the processor 804, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 802 may be specifically but not limited to be used for storing a document to be annotated, annotation data corresponding to the document, and other information. As an example, as shown in fig. 8, the memory 802 may include, but is not limited to, an obtaining unit 71, a first labeling unit 73, a second labeling unit 75, and a determining unit 77 in the data labeling apparatus including the text. In addition, the data annotation device may further include, but is not limited to, other module units in the data annotation device for the text, which is not described in detail in this example.
Optionally, the transmitting device 806 is configured to receive or transmit data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 806 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 806 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In addition, the electronic device further includes: a display 808, configured to display the to-be-processed document information; and a connection bus 810 for connecting the respective module parts in the above-described electronic apparatus.
According to a further aspect of an embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring a text to be labeled, wherein the text at least comprises a target object to be labeled;
s2, labeling the text with data through a first hierarchical layer-by-layer serial processing mode to obtain first labeled data, and labeling the text with data through a second hierarchical parallel processing mode to obtain second labeled data;
s3, labeling the part of the first labeling data which is different from the second labeling data according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data which is the same as the second labeling data to obtain fourth labeling data;
s4, the third annotation data and the fourth annotation data are determined as the annotation data of the text.
Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for labeling data of a text is characterized by comprising the following steps:
acquiring a text to be labeled, wherein the text at least comprises a target object to be labeled;
labeling the text with data in a first hierarchical serial processing mode to obtain first labeled data, and labeling the text with data in a second hierarchical parallel processing mode to obtain second labeled data;
labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data;
and determining the third annotation data and the fourth annotation data as the annotation data of the text.
2. The method of claim 1, wherein labeling the text data by the first processing method to obtain first labeling data comprises:
determining a first category corresponding to the text, and inputting the text into a first layer of a first neural network according to the first category to obtain marking data corresponding to the first category;
and inputting the labeling data corresponding to the first category into a second layer of the first neural network to obtain the first labeling data.
3. The method of claim 1, wherein labeling the text data by the second processing method to obtain second labeling data comprises:
determining a second category and a third category corresponding to the text according to different classification modes;
inputting the second class into a second neural network to obtain labeling data corresponding to the second class, and inputting the third class into a third neural network to obtain labeling data corresponding to the third class;
and processing the labeling data corresponding to the second category and the labeling data corresponding to the third category according to preset conditions to obtain the second labeling data.
4. The method of claim 1, wherein after determining the third and fourth annotation data as the annotation data of the text, the method further comprises:
inputting the labeled data of the text into a target neural network model, and outputting the probability of executing target operation on a target object;
in the event that the probability is greater than a predetermined threshold, performing the target operation in response to an instruction to the target object.
5. The method of claim 4, wherein performing the target operation in response to the instruction to the target object comprises:
responding to the instruction of the target object to execute the operation of deleting the labeled data of the target object; or;
responding to the instruction of the target object to execute the operation of adding the label data of the target object; or
And responding to the instruction of the target object to execute the operation of updating the annotation data of the target object.
6. A data annotation device for text, comprising:
the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a text to be labeled, and the text at least comprises a target object to be labeled;
the first labeling unit is used for labeling the text with data in a first hierarchical layer-by-layer serial processing mode to obtain first labeled data, and labeling the text with data in a second hierarchical parallel processing mode to obtain second labeled data;
the second labeling unit is used for labeling the part of the first labeling data, which is different from the second labeling data, according to a preset rule to obtain third labeling data, and labeling the part of the first labeling data, which is the same as the second labeling data, to obtain fourth labeling data;
and the determining unit is used for determining the third annotation data and the fourth annotation data as the annotation data of the text.
7. The apparatus of claim 6, wherein the first labeling unit comprises:
the first obtaining module is used for determining a first category corresponding to the text, inputting the text into a first layer of a first neural network according to the first category, and obtaining marking data corresponding to the first category;
and the second obtaining module is used for inputting the marking data corresponding to the first class into a second layer of the first neural network to obtain the first marking data.
8. The apparatus of claim 6, wherein the first labeling unit further comprises:
the determining module is used for determining a second category and a third category corresponding to the text according to different classification modes;
a third obtaining module, configured to obtain labeling data corresponding to the second category by inputting the labeling data to a second neural network according to the second category, and obtain labeling data corresponding to the third category by inputting the labeling data to a third neural network according to the third category;
and the fourth obtaining module is used for processing the labeling data corresponding to the second category and the labeling data corresponding to the third category according to preset conditions to obtain the second labeling data.
9. A computer-readable storage medium, characterized in that the storage medium comprises a stored program, wherein the program is operative to perform the method of any of the preceding claims 1 to 5.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 5 by means of the computer program.
CN202010712345.5A 2020-07-22 2020-07-22 Text data labeling method and device, storage medium and electronic device Active CN111859862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010712345.5A CN111859862B (en) 2020-07-22 2020-07-22 Text data labeling method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010712345.5A CN111859862B (en) 2020-07-22 2020-07-22 Text data labeling method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN111859862A true CN111859862A (en) 2020-10-30
CN111859862B CN111859862B (en) 2024-03-22

Family

ID=72949280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010712345.5A Active CN111859862B (en) 2020-07-22 2020-07-22 Text data labeling method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN111859862B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600577A (en) * 2022-10-21 2023-01-13 文灵科技(北京)有限公司(Cn) Event segmentation method and system for news manuscript labeling
CN115638833A (en) * 2022-12-23 2023-01-24 保定网城软件股份有限公司 Monitoring data processing method and system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010250814A (en) * 2009-04-14 2010-11-04 Nec (China) Co Ltd Part-of-speech tagging system, training device and method of part-of-speech tagging model
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
US20160321358A1 (en) * 2015-04-30 2016-11-03 Oracle International Corporation Character-based attribute value extraction system
CN106707293A (en) * 2016-12-01 2017-05-24 百度在线网络技术(北京)有限公司 Obstacle recognition method and device for vehicles
WO2019095899A1 (en) * 2017-11-17 2019-05-23 中兴通讯股份有限公司 Material annotation method and apparatus, terminal, and computer readable storage medium
CN110427487A (en) * 2019-07-30 2019-11-08 中国工商银行股份有限公司 A kind of data mask method, device and storage medium
CN110598206A (en) * 2019-08-13 2019-12-20 平安国际智慧城市科技股份有限公司 Text semantic recognition method and device, computer equipment and storage medium
CN110750694A (en) * 2019-09-29 2020-02-04 支付宝(杭州)信息技术有限公司 Data annotation implementation method and device, electronic equipment and storage medium
CN110909768A (en) * 2019-11-04 2020-03-24 北京地平线机器人技术研发有限公司 Method and device for acquiring marked data
CN111159404A (en) * 2019-12-27 2020-05-15 海尔优家智能科技(北京)有限公司 Text classification method and device
CN111159494A (en) * 2019-12-30 2020-05-15 北京航天云路有限公司 Multi-user concurrent processing data labeling method
CN111352348A (en) * 2018-12-24 2020-06-30 北京三星通信技术研究有限公司 Device control method, device, electronic device, and computer-readable storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010250814A (en) * 2009-04-14 2010-11-04 Nec (China) Co Ltd Part-of-speech tagging system, training device and method of part-of-speech tagging model
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
US20160321358A1 (en) * 2015-04-30 2016-11-03 Oracle International Corporation Character-based attribute value extraction system
CN106707293A (en) * 2016-12-01 2017-05-24 百度在线网络技术(北京)有限公司 Obstacle recognition method and device for vehicles
WO2019095899A1 (en) * 2017-11-17 2019-05-23 中兴通讯股份有限公司 Material annotation method and apparatus, terminal, and computer readable storage medium
CN111352348A (en) * 2018-12-24 2020-06-30 北京三星通信技术研究有限公司 Device control method, device, electronic device, and computer-readable storage medium
CN110427487A (en) * 2019-07-30 2019-11-08 中国工商银行股份有限公司 A kind of data mask method, device and storage medium
CN110598206A (en) * 2019-08-13 2019-12-20 平安国际智慧城市科技股份有限公司 Text semantic recognition method and device, computer equipment and storage medium
CN110750694A (en) * 2019-09-29 2020-02-04 支付宝(杭州)信息技术有限公司 Data annotation implementation method and device, electronic equipment and storage medium
CN110909768A (en) * 2019-11-04 2020-03-24 北京地平线机器人技术研发有限公司 Method and device for acquiring marked data
CN111159404A (en) * 2019-12-27 2020-05-15 海尔优家智能科技(北京)有限公司 Text classification method and device
CN111159494A (en) * 2019-12-30 2020-05-15 北京航天云路有限公司 Multi-user concurrent processing data labeling method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SHUZI NIU 等: "Top-k learning torank Labeling ranking and evalua", 《RESEARCH GATE》, pages 1 - 10 *
孙劲光,马志芳: "基于情感词属性和云模型的文本情感分类方法", 计算机工程, pages 211 - 215 *
马安香;高克宁;张晓红;张斌;: "基于CPN网络的Deep Web数据语义标注", 东北大学学报(自然科学版), no. 06, pages 36 - 39 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600577A (en) * 2022-10-21 2023-01-13 文灵科技(北京)有限公司(Cn) Event segmentation method and system for news manuscript labeling
CN115638833A (en) * 2022-12-23 2023-01-24 保定网城软件股份有限公司 Monitoring data processing method and system

Also Published As

Publication number Publication date
CN111859862B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
US11526675B2 (en) Fact checking
US11200259B2 (en) System and method for processing contract documents
US11093698B2 (en) Method and apparatus and computer device for automatic semantic annotation for an image
WO2022218186A1 (en) Method and apparatus for generating personalized knowledge graph, and computer device
CN101464905B (en) Web page information extraction system and method
CN112070138B (en) Construction method of multi-label mixed classification model, news classification method and system
CN112860841B (en) Text emotion analysis method, device, equipment and storage medium
US20190102614A1 (en) Systems and method for generating event timelines using human language technology
CN109582772B (en) Contract information extraction method, contract information extraction device, computer equipment and storage medium
CN112163424A (en) Data labeling method, device, equipment and medium
CN115827895A (en) Vulnerability knowledge graph processing method, device, equipment and medium
CN113656805A (en) A method and system for automatically constructing event graph for multi-source vulnerability information
CN110580308A (en) information auditing method and device, electronic equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN117707922A (en) Method and device for generating test case, terminal equipment and readable storage medium
CN114138244A (en) Method and device for automatically generating model files, storage medium and electronic equipment
CN113704420A (en) Method and device for identifying role in text, electronic equipment and storage medium
CN110866172A (en) Data analysis method for block chain system
CN111859862B (en) Text data labeling method and device, storage medium and electronic device
CN116484025A (en) Vulnerability knowledge graph construction method, vulnerability knowledge graph evaluation equipment and storage medium
Tsoukalas et al. An ontology-based approach for automatic specification, verification, and validation of software security requirements: Preliminary results
CN114117299A (en) A kind of website intrusion and tampering detection method, device, equipment and storage medium
CN111274813A (en) Language sequence marking method, device storage medium and computer equipment
CN113505889B (en) Processing method and device of mapping knowledge base, computer equipment and storage medium
WO2010025062A1 (en) Automatic test map generation for system verification test

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant