CN112487798A - Text efficient and accurate noise word processing method based on knowledge graph - Google Patents
Text efficient and accurate noise word processing method based on knowledge graph Download PDFInfo
- Publication number
- CN112487798A CN112487798A CN202011422655.XA CN202011422655A CN112487798A CN 112487798 A CN112487798 A CN 112487798A CN 202011422655 A CN202011422655 A CN 202011422655A CN 112487798 A CN112487798 A CN 112487798A
- Authority
- CN
- China
- Prior art keywords
- text
- words
- word
- knowledge
- filtering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title description 6
- 238000001914 filtration Methods 0.000 claims abstract description 37
- 238000000034 method Methods 0.000 claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 12
- 230000011218 segmentation Effects 0.000 claims abstract description 9
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000013077 scoring method Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 claims description 3
- 238000001308 synthesis method Methods 0.000 claims description 3
- 238000012800 visualization Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 239000000969 carrier Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010225 co-occurrence analysis Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a method for efficiently and accurately processing text noise words based on a knowledge graph, which comprises the following steps: s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words; s2, adding weight to each word; s3, segmenting the text by a segmentation tool; s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts; s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering; s6, outputting the filtered text, filtering the text data, accurately and stably, flexibly expanding and modifying in real time, overcoming the interference of polyphone synonyms of the text, filtering noise words while perfectly preserving sentence meaning, and providing high-efficiency and accurate noise word filtering processing capability in the scene of converting spoken language into text.
Description
Technical Field
The invention relates to the technical field of knowledge graphs, in particular to a high-efficiency and accurate text noise word processing method based on a knowledge graph.
Background
Knowledge Graph (knowledgegraph), which is called Knowledge domain visualization or Knowledge domain mapping map in the book intelligence world, is a series of different graphs displaying the relationship between the Knowledge development process and the structure, describes Knowledge resources and carriers thereof by using visualization technology, excavates, analyzes, constructs, draws and displays Knowledge and the mutual connection between the Knowledge and the Knowledge resources, combines the theories and methods of applying subjects such as mathematics, graphics, information visualization technology, information science and the like with the methods such as metrology introduction analysis, co-occurrence analysis and the like, and vividly displays the theories of the core structure, development history, front edge field and overall Knowledge framework of the subjects by using the visualization Graph to achieve the multi-subject fusion purpose, the Knowledge Graph can provide practical and valuable references for the study of the subjects;
however, in the current high-efficiency accurate noise word processing method, a complete sensitive word lexicon is matched with text words in sensitive word filtering, the words can be filtered from the text when the sensitive words are found in the text, only the sensitive words are concerned, a plurality of voice auxiliary words in a spoken language environment belong to noise words, the noise word filtering capability aiming at the conversion of spoken language into text is not provided, homonyms are easy to be incorrectly deleted or deleted in a missing manner during filtering, and sentence meanings are influenced.
Disclosure of Invention
The invention provides a text high-efficiency and accurate noise word processing method based on a knowledge graph, which can effectively solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: the method for efficiently and accurately processing the noise words of the text based on the knowledge graph comprises the following steps:
s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words;
s2, adding weight to each word;
s3, segmenting the text by a segmentation tool;
s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts;
s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering;
and S6, outputting the filtered text.
According to the technical scheme, the word bank of the filtering words in the S1 is connected with network data, and the word bank is classified, including political, pornographic, violent, near and arcane words, and a cross semantic interweaving network is established for the word bank.
According to the technical scheme, the relation of various labels, concepts and entities is recorded in the knowledge graph in the S1;
the knowledge graph adds service related words, mutually refers to the filtering word library, marks overlapped words, analyzes the rear relation of the words and judges whether to generate new word meanings or not, and therefore word errors are avoided.
According to the technical scheme, the weight in the S2 is used in combination with one or more of a scoring method, a statistical method, a sequence synthesis method, a formula method, a mathematical statistics method, an analytic hierarchy process and a complexity analysis method.
According to the technical scheme, the word segmentation tool in the S3 is a tool for segmenting the text into word segments according to the grammar statistical rules or the custom dictionary, so that the words are divided into combined words of 1-5 words.
According to the technical scheme, in the step S4, words are compared with texts through a knowledge graph, the words are compared with filtering words, the filtering words are removed, and some segmented words are separated and remarked.
According to the technical scheme, the service words recorded in the S5 are still stored, and then are introduced into a filtering word bank to be matched with the filtering word bank for screening, and the filtering words are matched with the text, so that the error rate is reduced.
According to the technical scheme, after the filtered text is output in the S6, the text needs to be judged manually and annotated, words are labeled and colored, and the text is determined after approval of a reviewer.
Compared with the prior art, the invention has the beneficial effects that: the method filters text data, is accurate and stable, can flexibly expand and modify in real time, can overcome the interference of text polyphone synonyms, perfectly keeps sentence meanings while filtering noise words, and provides high-efficiency and accurate noise word filtering processing capability in a scene of converting spoken language into text.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
In the drawings:
fig. 1 is a schematic structural view of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1:
as shown in FIG. 1, the invention provides a technical scheme, and a method for efficiently and accurately processing a noise word in a text based on a knowledge graph, which comprises the following steps:
s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words;
s2, adding weight to each word;
s3, segmenting the text by a segmentation tool;
s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts;
s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering;
and S6, outputting the filtered text.
According to the technical scheme, the word bank of the filtering words in the S1 is connected with network data, and the word bank is classified, wherein the word bank comprises political, pornographic, violent, near and arcane words, and a cross semantic interweaving network is established for the word bank.
According to the technical scheme, the relation among various labels, concepts and entities is recorded in the knowledge graph in S1;
the knowledge graph adds service related words, the service related words and the filtering word library are mutually quoted, the overlapped words are marked, the rear relation of the words is analyzed, whether new word meanings are generated or not is judged, and therefore word errors are avoided.
According to the technical scheme, the weight in the S2 is used in combination with one or more of a scoring method, a statistical method, a sequence synthesis method, a formula method, a mathematical statistics method, an analytic hierarchy process and a complexity analysis method.
According to the technical scheme, the word segmentation tool in the S3 is a tool for segmenting the text into word segments according to the grammar statistical rules or the custom dictionary, so that the words are divided into combined words of 1-5 words.
According to the technical scheme, words are compared with the text through the knowledge graph in S4, the words are compared with the filtering words, the filtering words are removed, and some segmented words are separated and remarked.
According to the technical scheme, the service words recorded in the S5 are still stored, and then are introduced into the filtering word bank to be matched with the filtering word bank for screening, and the filtering words are matched with the text, so that the error rate is reduced.
According to the technical scheme, after the filtered text is output in the S6, the text needs to be judged manually, annotation needs to be carried out, words are labeled and colored, and the text is determined after approval of a reviewer.
Example 2:
the invention provides a technical scheme, and discloses a text high-efficiency and accurate noise word processing method based on a knowledge graph, which comprises the following steps:
s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words, wherein the filtering word bank is preferably classified to improve reusability;
s2, adding weight to each word;
s3, segmenting the text through a segmentation tool, and adjusting the word segmentation effect:
word weight: [ "collect" 200, "concentrate" 100]
Text: "like to collect the products of independent innovation of China"
Word cutting results: "like-collect-china-independent-innovative-product";
s4, firstly, correcting the text homophones into service words through the service knowledge map and recording all the service words appearing in the text:
text: ' kaeha, i like Haerbin autumn-Dai ' warmth '
As a result: "kaiha, i like the warmth of the Karl-Kui overcoat" [ Karl-Kui ] [ overcoat ];
s5, matching the corrected text with a filtering word, wherein the recorded service word is not influenced by filtering:
inputting: ' Kanha I like the warmth of the autumn overcoat of Harbin [ Harbin ] [ overcoat ]
And (3) noise filtering: "I like the warmth of the coat in Haerbin autumn";
6. and outputting the filtered text.
Compared with the prior art, the invention has the beneficial effects that: the method filters text data, is accurate and stable, can flexibly expand and modify in real time, can overcome the interference of text polyphone synonyms, perfectly keeps sentence meanings while filtering noise words, and provides high-efficiency and accurate noise word filtering processing capability in a scene of converting spoken language into text.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (8)
1. The method for efficiently and accurately processing the noise words in the text based on the knowledge graph is characterized by comprising the following steps: the method comprises the following steps:
s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words;
s2, adding weight to each word;
s3, segmenting the text by a segmentation tool;
s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts;
s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering;
and S6, outputting the filtered text.
2. The method for processing the text with high efficiency and accurate noise based on the knowledge-graph as claimed in claim 1, wherein the word library filtered in S1 is connected with network data, and is classified into political, erotic, violent, near-meaning and obscure words, and is established with cross-semantic interleaving network.
3. The method for processing the text efficient and accurate noise words based on the knowledge-graph of claim 1, wherein the knowledge-graph in the S1 records the relationship among various labels, concepts and entities;
the knowledge graph adds service related words, mutually refers to the filtering word library, marks overlapped words, analyzes the rear relation of the words and judges whether to generate new word meanings or not, and therefore word errors are avoided.
4. The method for processing the text efficient and accurate noise words based on the knowledge-graph of claim 1, wherein the weight in the step S2 is obtained by one or more of a scoring method, a statistical method, a sequence synthesis method, a formula method, a mathematical statistics method, an analytic hierarchy method and a complexity analysis method.
5. The method of claim 1, wherein the segmentation tool in S3 is a tool for segmenting text into word segments according to grammar rules or a custom dictionary, so as to segment words into 1-5 words.
6. The method as claimed in claim 1, wherein in S4, the term comparison is performed on the knowledge map and the text, and the term comparison is performed on the knowledge map and the filter terms, so as to remove the filter terms, and some segmented terms are separated and remarked.
7. The method for efficiently and accurately processing the noise words in the knowledge-graph-based text according to claim 1, wherein the service words recorded in S5 are still stored, and then introduced into a word bank of filter words, and are matched with the word bank of filter words for screening, and are matched with the text for matching the filter words, so that the error rate is reduced.
8. The method for efficiently and accurately processing the noise words in the knowledge-graph-based text according to claim 1, wherein the filtered text is output in S6, and the filtered text is manually judged, annotated, labeled and colored, and then determined after approval of a reviewer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011422655.XA CN112487798A (en) | 2020-12-08 | 2020-12-08 | Text efficient and accurate noise word processing method based on knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011422655.XA CN112487798A (en) | 2020-12-08 | 2020-12-08 | Text efficient and accurate noise word processing method based on knowledge graph |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112487798A true CN112487798A (en) | 2021-03-12 |
Family
ID=74940713
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011422655.XA Pending CN112487798A (en) | 2020-12-08 | 2020-12-08 | Text efficient and accurate noise word processing method based on knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112487798A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050071152A1 (en) * | 2003-09-29 | 2005-03-31 | Hitachi, Ltd. | Cross lingual text classification apparatus and method |
CN106055541A (en) * | 2016-06-29 | 2016-10-26 | 清华大学 | News content sensitive word filtering method and system |
CN109146610A (en) * | 2018-07-16 | 2019-01-04 | 众安在线财产保险股份有限公司 | It is a kind of intelligently to insure recommended method, device and intelligence insurance robot device |
CN110176237A (en) * | 2019-07-09 | 2019-08-27 | 北京金山数字娱乐科技有限公司 | A kind of audio recognition method and device |
-
2020
- 2020-12-08 CN CN202011422655.XA patent/CN112487798A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050071152A1 (en) * | 2003-09-29 | 2005-03-31 | Hitachi, Ltd. | Cross lingual text classification apparatus and method |
CN106055541A (en) * | 2016-06-29 | 2016-10-26 | 清华大学 | News content sensitive word filtering method and system |
CN109146610A (en) * | 2018-07-16 | 2019-01-04 | 众安在线财产保险股份有限公司 | It is a kind of intelligently to insure recommended method, device and intelligence insurance robot device |
CN110176237A (en) * | 2019-07-09 | 2019-08-27 | 北京金山数字娱乐科技有限公司 | A kind of audio recognition method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR100912502B1 (en) | Machine translation method for PDF file | |
CN107392143B (en) | Resume accurate analysis method based on SVM text classification | |
CN110046261B (en) | Construction method of multi-modal bilingual parallel corpus of construction engineering | |
CN105975625A (en) | Chinglish inquiring correcting method and system oriented to English search engine | |
CN110609983B (en) | Structured decomposition method for policy file | |
CN114065738B (en) | Chinese spelling error correction method based on multitask learning | |
CN105893414A (en) | Method and apparatus for screening valid term of a pronunciation lexicon | |
CN108681529B (en) | Multi-language text and voice generation method of flow model diagram | |
CN109740159B (en) | Processing method and device for named entity recognition | |
CN106294326B (en) | A kind of news report Sentiment orientation analysis method | |
CN106294466A (en) | Disaggregated model construction method, disaggregated model build equipment and sorting technique | |
CN107818082B (en) | Semantic role recognition method combined with phrase structure tree | |
JP6061337B2 (en) | Rule generation device and extraction device | |
CN107145584A (en) | A kind of resume analytic method based on n gram models | |
CN106096664A (en) | A kind of sentiment analysis method based on social network data | |
CN107392433A (en) | A kind of method and apparatus for extracting enterprise's incidence relation information | |
CN104778256A (en) | Rapid incremental clustering method for domain question-answering system consultations | |
CN110175585A (en) | It is a kind of letter answer correct system and method automatically | |
CN112001178A (en) | Long-tail entity identification and disambiguation method | |
CN102955775A (en) | Automatic foreign name identification and control method based on context semantics | |
CN109783819A (en) | Regular expression generation method and system | |
CN110175337B (en) | Text display method and device | |
CN109902299A (en) | A kind of text handling method and device | |
CN112487798A (en) | Text efficient and accurate noise word processing method based on knowledge graph | |
CN117542353B (en) | Voice understanding method based on knowledge graph and voice feature fusion network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |