CN112487798A

CN112487798A - Text efficient and accurate noise word processing method based on knowledge graph

Info

Publication number: CN112487798A
Application number: CN202011422655.XA
Authority: CN
Inventors: 李抒雁; 沙涛
Original assignee: Shanghai Shixiang Culture Communication Co ltd
Current assignee: Shanghai Shixiang Culture Communication Co ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-03-12

Abstract

The invention discloses a method for efficiently and accurately processing text noise words based on a knowledge graph, which comprises the following steps: s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words; s2, adding weight to each word; s3, segmenting the text by a segmentation tool; s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts; s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering; s6, outputting the filtered text, filtering the text data, accurately and stably, flexibly expanding and modifying in real time, overcoming the interference of polyphone synonyms of the text, filtering noise words while perfectly preserving sentence meaning, and providing high-efficiency and accurate noise word filtering processing capability in the scene of converting spoken language into text.

Description

Text efficient and accurate noise word processing method based on knowledge graph

Technical Field

The invention relates to the technical field of knowledge graphs, in particular to a high-efficiency and accurate text noise word processing method based on a knowledge graph.

Background

Knowledge Graph (knowledgegraph), which is called Knowledge domain visualization or Knowledge domain mapping map in the book intelligence world, is a series of different graphs displaying the relationship between the Knowledge development process and the structure, describes Knowledge resources and carriers thereof by using visualization technology, excavates, analyzes, constructs, draws and displays Knowledge and the mutual connection between the Knowledge and the Knowledge resources, combines the theories and methods of applying subjects such as mathematics, graphics, information visualization technology, information science and the like with the methods such as metrology introduction analysis, co-occurrence analysis and the like, and vividly displays the theories of the core structure, development history, front edge field and overall Knowledge framework of the subjects by using the visualization Graph to achieve the multi-subject fusion purpose, the Knowledge Graph can provide practical and valuable references for the study of the subjects;

however, in the current high-efficiency accurate noise word processing method, a complete sensitive word lexicon is matched with text words in sensitive word filtering, the words can be filtered from the text when the sensitive words are found in the text, only the sensitive words are concerned, a plurality of voice auxiliary words in a spoken language environment belong to noise words, the noise word filtering capability aiming at the conversion of spoken language into text is not provided, homonyms are easy to be incorrectly deleted or deleted in a missing manner during filtering, and sentence meanings are influenced.

Disclosure of Invention

The invention provides a text high-efficiency and accurate noise word processing method based on a knowledge graph, which can effectively solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: the method for efficiently and accurately processing the noise words of the text based on the knowledge graph comprises the following steps:

s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words;

s2, adding weight to each word;

s3, segmenting the text by a segmentation tool;

s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts;

s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering;

and S6, outputting the filtered text.

According to the technical scheme, the word bank of the filtering words in the S1 is connected with network data, and the word bank is classified, including political, pornographic, violent, near and arcane words, and a cross semantic interweaving network is established for the word bank.

According to the technical scheme, the relation of various labels, concepts and entities is recorded in the knowledge graph in the S1;

the knowledge graph adds service related words, mutually refers to the filtering word library, marks overlapped words, analyzes the rear relation of the words and judges whether to generate new word meanings or not, and therefore word errors are avoided.

According to the technical scheme, the weight in the S2 is used in combination with one or more of a scoring method, a statistical method, a sequence synthesis method, a formula method, a mathematical statistics method, an analytic hierarchy process and a complexity analysis method.

According to the technical scheme, the word segmentation tool in the S3 is a tool for segmenting the text into word segments according to the grammar statistical rules or the custom dictionary, so that the words are divided into combined words of 1-5 words.

According to the technical scheme, in the step S4, words are compared with texts through a knowledge graph, the words are compared with filtering words, the filtering words are removed, and some segmented words are separated and remarked.

According to the technical scheme, the service words recorded in the S5 are still stored, and then are introduced into a filtering word bank to be matched with the filtering word bank for screening, and the filtering words are matched with the text, so that the error rate is reduced.

According to the technical scheme, after the filtered text is output in the S6, the text needs to be judged manually and annotated, words are labeled and colored, and the text is determined after approval of a reviewer.

Compared with the prior art, the invention has the beneficial effects that: the method filters text data, is accurate and stable, can flexibly expand and modify in real time, can overcome the interference of text polyphone synonyms, perfectly keeps sentence meanings while filtering noise words, and provides high-efficiency and accurate noise word filtering processing capability in a scene of converting spoken language into text.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

In the drawings:

fig. 1 is a schematic structural view of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example 1:

as shown in FIG. 1, the invention provides a technical scheme, and a method for efficiently and accurately processing a noise word in a text based on a knowledge graph, which comprises the following steps:

s2, adding weight to each word;

s3, segmenting the text by a segmentation tool;

and S6, outputting the filtered text.

According to the technical scheme, the word bank of the filtering words in the S1 is connected with network data, and the word bank is classified, wherein the word bank comprises political, pornographic, violent, near and arcane words, and a cross semantic interweaving network is established for the word bank.

According to the technical scheme, the relation among various labels, concepts and entities is recorded in the knowledge graph in S1;

the knowledge graph adds service related words, the service related words and the filtering word library are mutually quoted, the overlapped words are marked, the rear relation of the words is analyzed, whether new word meanings are generated or not is judged, and therefore word errors are avoided.

According to the technical scheme, words are compared with the text through the knowledge graph in S4, the words are compared with the filtering words, the filtering words are removed, and some segmented words are separated and remarked.

According to the technical scheme, the service words recorded in the S5 are still stored, and then are introduced into the filtering word bank to be matched with the filtering word bank for screening, and the filtering words are matched with the text, so that the error rate is reduced.

According to the technical scheme, after the filtered text is output in the S6, the text needs to be judged manually, annotation needs to be carried out, words are labeled and colored, and the text is determined after approval of a reviewer.

Example 2:

the invention provides a technical scheme, and discloses a text high-efficiency and accurate noise word processing method based on a knowledge graph, which comprises the following steps:

s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words, wherein the filtering word bank is preferably classified to improve reusability;

s2, adding weight to each word;

s3, segmenting the text through a segmentation tool, and adjusting the word segmentation effect:

word weight: [ "collect" 200, "concentrate" 100]

Text: "like to collect the products of independent innovation of China"

Word cutting results: "like-collect-china-independent-innovative-product";

s4, firstly, correcting the text homophones into service words through the service knowledge map and recording all the service words appearing in the text:

text: ' kaeha, i like Haerbin autumn-Dai ' warmth '

As a result: "kaiha, i like the warmth of the Karl-Kui overcoat" [ Karl-Kui ] [ overcoat ];

s5, matching the corrected text with a filtering word, wherein the recorded service word is not influenced by filtering:

inputting: ' Kanha I like the warmth of the autumn overcoat of Harbin [ Harbin ] [ overcoat ]

And (3) noise filtering: "I like the warmth of the coat in Haerbin autumn";

6. and outputting the filtered text.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for efficiently and accurately processing the noise words in the text based on the knowledge graph is characterized by comprising the following steps: the method comprises the following steps:

s2, adding weight to each word;

s3, segmenting the text by a segmentation tool;

and S6, outputting the filtered text.

2. The method for processing the text with high efficiency and accurate noise based on the knowledge-graph as claimed in claim 1, wherein the word library filtered in S1 is connected with network data, and is classified into political, erotic, violent, near-meaning and obscure words, and is established with cross-semantic interleaving network.

3. The method for processing the text efficient and accurate noise words based on the knowledge-graph of claim 1, wherein the knowledge-graph in the S1 records the relationship among various labels, concepts and entities;

4. The method for processing the text efficient and accurate noise words based on the knowledge-graph of claim 1, wherein the weight in the step S2 is obtained by one or more of a scoring method, a statistical method, a sequence synthesis method, a formula method, a mathematical statistics method, an analytic hierarchy method and a complexity analysis method.

5. The method of claim 1, wherein the segmentation tool in S3 is a tool for segmenting text into word segments according to grammar rules or a custom dictionary, so as to segment words into 1-5 words.

6. The method as claimed in claim 1, wherein in S4, the term comparison is performed on the knowledge map and the text, and the term comparison is performed on the knowledge map and the filter terms, so as to remove the filter terms, and some segmented terms are separated and remarked.

7. The method for efficiently and accurately processing the noise words in the knowledge-graph-based text according to claim 1, wherein the service words recorded in S5 are still stored, and then introduced into a word bank of filter words, and are matched with the word bank of filter words for screening, and are matched with the text for matching the filter words, so that the error rate is reduced.

8. The method for efficiently and accurately processing the noise words in the knowledge-graph-based text according to claim 1, wherein the filtered text is output in S6, and the filtered text is manually judged, annotated, labeled and colored, and then determined after approval of a reviewer.