Nothing Special   »   [go: up one dir, main page]

CN112487798A - Text efficient and accurate noise word processing method based on knowledge graph - Google Patents

Text efficient and accurate noise word processing method based on knowledge graph Download PDF

Info

Publication number
CN112487798A
CN112487798A CN202011422655.XA CN202011422655A CN112487798A CN 112487798 A CN112487798 A CN 112487798A CN 202011422655 A CN202011422655 A CN 202011422655A CN 112487798 A CN112487798 A CN 112487798A
Authority
CN
China
Prior art keywords
text
words
word
knowledge
filtering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011422655.XA
Other languages
Chinese (zh)
Inventor
李抒雁
沙涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shixiang Culture Communication Co ltd
Original Assignee
Shanghai Shixiang Culture Communication Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shixiang Culture Communication Co ltd filed Critical Shanghai Shixiang Culture Communication Co ltd
Priority to CN202011422655.XA priority Critical patent/CN112487798A/en
Publication of CN112487798A publication Critical patent/CN112487798A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method for efficiently and accurately processing text noise words based on a knowledge graph, which comprises the following steps: s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words; s2, adding weight to each word; s3, segmenting the text by a segmentation tool; s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts; s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering; s6, outputting the filtered text, filtering the text data, accurately and stably, flexibly expanding and modifying in real time, overcoming the interference of polyphone synonyms of the text, filtering noise words while perfectly preserving sentence meaning, and providing high-efficiency and accurate noise word filtering processing capability in the scene of converting spoken language into text.

Description

Text efficient and accurate noise word processing method based on knowledge graph
Technical Field
The invention relates to the technical field of knowledge graphs, in particular to a high-efficiency and accurate text noise word processing method based on a knowledge graph.
Background
Knowledge Graph (knowledgegraph), which is called Knowledge domain visualization or Knowledge domain mapping map in the book intelligence world, is a series of different graphs displaying the relationship between the Knowledge development process and the structure, describes Knowledge resources and carriers thereof by using visualization technology, excavates, analyzes, constructs, draws and displays Knowledge and the mutual connection between the Knowledge and the Knowledge resources, combines the theories and methods of applying subjects such as mathematics, graphics, information visualization technology, information science and the like with the methods such as metrology introduction analysis, co-occurrence analysis and the like, and vividly displays the theories of the core structure, development history, front edge field and overall Knowledge framework of the subjects by using the visualization Graph to achieve the multi-subject fusion purpose, the Knowledge Graph can provide practical and valuable references for the study of the subjects;
however, in the current high-efficiency accurate noise word processing method, a complete sensitive word lexicon is matched with text words in sensitive word filtering, the words can be filtered from the text when the sensitive words are found in the text, only the sensitive words are concerned, a plurality of voice auxiliary words in a spoken language environment belong to noise words, the noise word filtering capability aiming at the conversion of spoken language into text is not provided, homonyms are easy to be incorrectly deleted or deleted in a missing manner during filtering, and sentence meanings are influenced.
Disclosure of Invention
The invention provides a text high-efficiency and accurate noise word processing method based on a knowledge graph, which can effectively solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: the method for efficiently and accurately processing the noise words of the text based on the knowledge graph comprises the following steps:
s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words;
s2, adding weight to each word;
s3, segmenting the text by a segmentation tool;
s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts;
s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering;
and S6, outputting the filtered text.
According to the technical scheme, the word bank of the filtering words in the S1 is connected with network data, and the word bank is classified, including political, pornographic, violent, near and arcane words, and a cross semantic interweaving network is established for the word bank.
According to the technical scheme, the relation of various labels, concepts and entities is recorded in the knowledge graph in the S1;
the knowledge graph adds service related words, mutually refers to the filtering word library, marks overlapped words, analyzes the rear relation of the words and judges whether to generate new word meanings or not, and therefore word errors are avoided.
According to the technical scheme, the weight in the S2 is used in combination with one or more of a scoring method, a statistical method, a sequence synthesis method, a formula method, a mathematical statistics method, an analytic hierarchy process and a complexity analysis method.
According to the technical scheme, the word segmentation tool in the S3 is a tool for segmenting the text into word segments according to the grammar statistical rules or the custom dictionary, so that the words are divided into combined words of 1-5 words.
According to the technical scheme, in the step S4, words are compared with texts through a knowledge graph, the words are compared with filtering words, the filtering words are removed, and some segmented words are separated and remarked.
According to the technical scheme, the service words recorded in the S5 are still stored, and then are introduced into a filtering word bank to be matched with the filtering word bank for screening, and the filtering words are matched with the text, so that the error rate is reduced.
According to the technical scheme, after the filtered text is output in the S6, the text needs to be judged manually and annotated, words are labeled and colored, and the text is determined after approval of a reviewer.
Compared with the prior art, the invention has the beneficial effects that: the method filters text data, is accurate and stable, can flexibly expand and modify in real time, can overcome the interference of text polyphone synonyms, perfectly keeps sentence meanings while filtering noise words, and provides high-efficiency and accurate noise word filtering processing capability in a scene of converting spoken language into text.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
In the drawings:
fig. 1 is a schematic structural view of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1:
as shown in FIG. 1, the invention provides a technical scheme, and a method for efficiently and accurately processing a noise word in a text based on a knowledge graph, which comprises the following steps:
s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words;
s2, adding weight to each word;
s3, segmenting the text by a segmentation tool;
s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts;
s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering;
and S6, outputting the filtered text.
According to the technical scheme, the word bank of the filtering words in the S1 is connected with network data, and the word bank is classified, wherein the word bank comprises political, pornographic, violent, near and arcane words, and a cross semantic interweaving network is established for the word bank.
According to the technical scheme, the relation among various labels, concepts and entities is recorded in the knowledge graph in S1;
the knowledge graph adds service related words, the service related words and the filtering word library are mutually quoted, the overlapped words are marked, the rear relation of the words is analyzed, whether new word meanings are generated or not is judged, and therefore word errors are avoided.
According to the technical scheme, the weight in the S2 is used in combination with one or more of a scoring method, a statistical method, a sequence synthesis method, a formula method, a mathematical statistics method, an analytic hierarchy process and a complexity analysis method.
According to the technical scheme, the word segmentation tool in the S3 is a tool for segmenting the text into word segments according to the grammar statistical rules or the custom dictionary, so that the words are divided into combined words of 1-5 words.
According to the technical scheme, words are compared with the text through the knowledge graph in S4, the words are compared with the filtering words, the filtering words are removed, and some segmented words are separated and remarked.
According to the technical scheme, the service words recorded in the S5 are still stored, and then are introduced into the filtering word bank to be matched with the filtering word bank for screening, and the filtering words are matched with the text, so that the error rate is reduced.
According to the technical scheme, after the filtered text is output in the S6, the text needs to be judged manually, annotation needs to be carried out, words are labeled and colored, and the text is determined after approval of a reviewer.
Example 2:
the invention provides a technical scheme, and discloses a text high-efficiency and accurate noise word processing method based on a knowledge graph, which comprises the following steps:
s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words, wherein the filtering word bank is preferably classified to improve reusability;
s2, adding weight to each word;
s3, segmenting the text through a segmentation tool, and adjusting the word segmentation effect:
word weight: [ "collect" 200, "concentrate" 100]
Text: "like to collect the products of independent innovation of China"
Word cutting results: "like-collect-china-independent-innovative-product";
s4, firstly, correcting the text homophones into service words through the service knowledge map and recording all the service words appearing in the text:
text: ' kaeha, i like Haerbin autumn-Dai ' warmth '
As a result: "kaiha, i like the warmth of the Karl-Kui overcoat" [ Karl-Kui ] [ overcoat ];
s5, matching the corrected text with a filtering word, wherein the recorded service word is not influenced by filtering:
inputting: ' Kanha I like the warmth of the autumn overcoat of Harbin [ Harbin ] [ overcoat ]
And (3) noise filtering: "I like the warmth of the coat in Haerbin autumn";
6. and outputting the filtered text.
Compared with the prior art, the invention has the beneficial effects that: the method filters text data, is accurate and stable, can flexibly expand and modify in real time, can overcome the interference of text polyphone synonyms, perfectly keeps sentence meanings while filtering noise words, and provides high-efficiency and accurate noise word filtering processing capability in a scene of converting spoken language into text.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. The method for efficiently and accurately processing the noise words in the text based on the knowledge graph is characterized by comprising the following steps: the method comprises the following steps:
s1, building a word bank needing to be filtered, building a business knowledge graph and adding various homophones of business related words;
s2, adding weight to each word;
s3, segmenting the text by a segmentation tool;
s4, firstly, correcting the text homophones into service words through a service knowledge graph and recording the service words appearing in all texts;
s5, matching the corrected text with a filtering word, but keeping the recorded service word not influenced by filtering;
and S6, outputting the filtered text.
2. The method for processing the text with high efficiency and accurate noise based on the knowledge-graph as claimed in claim 1, wherein the word library filtered in S1 is connected with network data, and is classified into political, erotic, violent, near-meaning and obscure words, and is established with cross-semantic interleaving network.
3. The method for processing the text efficient and accurate noise words based on the knowledge-graph of claim 1, wherein the knowledge-graph in the S1 records the relationship among various labels, concepts and entities;
the knowledge graph adds service related words, mutually refers to the filtering word library, marks overlapped words, analyzes the rear relation of the words and judges whether to generate new word meanings or not, and therefore word errors are avoided.
4. The method for processing the text efficient and accurate noise words based on the knowledge-graph of claim 1, wherein the weight in the step S2 is obtained by one or more of a scoring method, a statistical method, a sequence synthesis method, a formula method, a mathematical statistics method, an analytic hierarchy method and a complexity analysis method.
5. The method of claim 1, wherein the segmentation tool in S3 is a tool for segmenting text into word segments according to grammar rules or a custom dictionary, so as to segment words into 1-5 words.
6. The method as claimed in claim 1, wherein in S4, the term comparison is performed on the knowledge map and the text, and the term comparison is performed on the knowledge map and the filter terms, so as to remove the filter terms, and some segmented terms are separated and remarked.
7. The method for efficiently and accurately processing the noise words in the knowledge-graph-based text according to claim 1, wherein the service words recorded in S5 are still stored, and then introduced into a word bank of filter words, and are matched with the word bank of filter words for screening, and are matched with the text for matching the filter words, so that the error rate is reduced.
8. The method for efficiently and accurately processing the noise words in the knowledge-graph-based text according to claim 1, wherein the filtered text is output in S6, and the filtered text is manually judged, annotated, labeled and colored, and then determined after approval of a reviewer.
CN202011422655.XA 2020-12-08 2020-12-08 Text efficient and accurate noise word processing method based on knowledge graph Pending CN112487798A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011422655.XA CN112487798A (en) 2020-12-08 2020-12-08 Text efficient and accurate noise word processing method based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011422655.XA CN112487798A (en) 2020-12-08 2020-12-08 Text efficient and accurate noise word processing method based on knowledge graph

Publications (1)

Publication Number Publication Date
CN112487798A true CN112487798A (en) 2021-03-12

Family

ID=74940713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011422655.XA Pending CN112487798A (en) 2020-12-08 2020-12-08 Text efficient and accurate noise word processing method based on knowledge graph

Country Status (1)

Country Link
CN (1) CN112487798A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071152A1 (en) * 2003-09-29 2005-03-31 Hitachi, Ltd. Cross lingual text classification apparatus and method
CN106055541A (en) * 2016-06-29 2016-10-26 清华大学 News content sensitive word filtering method and system
CN109146610A (en) * 2018-07-16 2019-01-04 众安在线财产保险股份有限公司 It is a kind of intelligently to insure recommended method, device and intelligence insurance robot device
CN110176237A (en) * 2019-07-09 2019-08-27 北京金山数字娱乐科技有限公司 A kind of audio recognition method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071152A1 (en) * 2003-09-29 2005-03-31 Hitachi, Ltd. Cross lingual text classification apparatus and method
CN106055541A (en) * 2016-06-29 2016-10-26 清华大学 News content sensitive word filtering method and system
CN109146610A (en) * 2018-07-16 2019-01-04 众安在线财产保险股份有限公司 It is a kind of intelligently to insure recommended method, device and intelligence insurance robot device
CN110176237A (en) * 2019-07-09 2019-08-27 北京金山数字娱乐科技有限公司 A kind of audio recognition method and device

Similar Documents

Publication Publication Date Title
KR100912502B1 (en) Machine translation method for PDF file
CN107392143B (en) Resume accurate analysis method based on SVM text classification
CN110046261B (en) Construction method of multi-modal bilingual parallel corpus of construction engineering
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN110609983B (en) Structured decomposition method for policy file
CN114065738B (en) Chinese spelling error correction method based on multitask learning
CN105893414A (en) Method and apparatus for screening valid term of a pronunciation lexicon
CN108681529B (en) Multi-language text and voice generation method of flow model diagram
CN109740159B (en) Processing method and device for named entity recognition
CN106294326B (en) A kind of news report Sentiment orientation analysis method
CN106294466A (en) Disaggregated model construction method, disaggregated model build equipment and sorting technique
CN107818082B (en) Semantic role recognition method combined with phrase structure tree
JP6061337B2 (en) Rule generation device and extraction device
CN107145584A (en) A kind of resume analytic method based on n gram models
CN106096664A (en) A kind of sentiment analysis method based on social network data
CN107392433A (en) A kind of method and apparatus for extracting enterprise's incidence relation information
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN110175585A (en) It is a kind of letter answer correct system and method automatically
CN112001178A (en) Long-tail entity identification and disambiguation method
CN102955775A (en) Automatic foreign name identification and control method based on context semantics
CN109783819A (en) Regular expression generation method and system
CN110175337B (en) Text display method and device
CN109902299A (en) A kind of text handling method and device
CN112487798A (en) Text efficient and accurate noise word processing method based on knowledge graph
CN117542353B (en) Voice understanding method based on knowledge graph and voice feature fusion network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination