CN107480128A - The segmenting method and device of Chinese text - Google Patents
The segmenting method and device of Chinese text Download PDFInfo
- Publication number
- CN107480128A CN107480128A CN201710581114.3A CN201710581114A CN107480128A CN 107480128 A CN107480128 A CN 107480128A CN 201710581114 A CN201710581114 A CN 201710581114A CN 107480128 A CN107480128 A CN 107480128A
- Authority
- CN
- China
- Prior art keywords
- text
- segmented
- dictionary
- word
- scene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Character Discrimination (AREA)
Abstract
The embodiment of the invention discloses a kind of segmenting method of Chinese text and device, by receiving text to be segmented, the text to be segmented initially is segmented after carrying out matching treatment according to standard dictionary, again to described after participle text initially segment, pass through CRF models, the scene of text to be segmented described in identification, then scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain the word segmentation result of the text to be segmented, can effectively solve the problems, such as that existing dictionary is not based on different scenes and carries out ambiguity processing, depth recognition is carried out to the confusing meaning of appearance of same section of vocabulary, accuracy is high.
Description
Technical field
The present invention relates to the segmenting method and device of computer realm, more particularly to a kind of Chinese text.
Background technology
Chinese word segmentation refers to is cut into single word one by one by a Chinese character sequence.Chinese word segmentation is in information retrieval, machine
Device is translated to play an important role with fields such as speech recognitions, is a link essential in Chinese speech processing procedure.Typically
Ground, due to the problem of segmentation ambiguity be present, the degree of accuracy of traditional mechanical segmentation method based on dictionary does not reach 100%.Example
Such as, " the Nanjing Yangtze Bridge " is segmented into " the Nanjing Yangtze Bridge ", can also be divided into " the Nanjing Yangtze Bridge ".If no
Other knowledge are relied on, two kinds of participle modes seem it is all rational
The dictionary that existing participle technique relies on is two-dimentional, at most there was only part of speech and simple probability right.In algorithm
On model, it is not based on different scenes and carries out ambiguity processing.The abbreviation entry of domestic dictionary manually adds at present, compares
Machinery.
The content of the invention
The purpose of the embodiment of the present invention is to provide the segmenting method and device of a kind of Chinese text, can effectively solve existing word
Storehouse is not based on the problem of different scenes carries out ambiguity processing, and accuracy is high.
To achieve the above object, the embodiments of the invention provide a kind of segmenting method of Chinese text, including step:
Text to be segmented is received, the text to be segmented initially is divided after carrying out matching treatment according to standard dictionary
Word;
To described after participle text initially segment, by CRF models, the scene of text to be segmented described in identification;
According to the scene to be segmented, ambiguity identification is carried out to the entry of the text to be segmented, it is described so as to obtain
The word segmentation result of text to be segmented.
Compared with prior art, the segmenting method of Chinese text disclosed by the invention is by receiving text to be segmented, according to
Standard dictionary is initially segmented after carrying out matching treatment to the text to be segmented, then the text to be segmented is carried out initial
After participle, by CRF models, the scene of text to be segmented described in identification, then scene to be segmented according to, is treated to described
The entry for segmenting text carries out ambiguity identification, so as to obtain the word segmentation result of the text to be segmented, can effectively solve existing word
Storehouse is not based on the problem of different scenes carries out ambiguity processing, and depth knowledge is carried out to the confusing meaning of appearance of same section of vocabulary
Not, accuracy is high.
As the improvement of such scheme, in addition to step:
Long word in the word segmentation result simplify corresponding to processing acquisition by the abbreviation model trained and abridged
Word, the abbreviation and the standard dictionary are subjected to matching verification.
As the improvement of such scheme, the standard dictionary is with polynary semantic network lexicon, passes through following steps
There is polynary semantic network lexicon described in obtaining:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
The embodiment of the present invention additionally provides a kind of participle device of Chinese text, including:
Initial word-dividing mode, for receiving text to be segmented, treated point to described after carrying out matching treatment according to standard dictionary
Word text is initially segmented;
Scene Recognition module, for described after participle text initially segment, by CRF models, described in identification
The scene of text to be segmented;
Ambiguity identification module, for scene to be segmented according to, ambiguity is carried out to the entry of the text to be segmented
Identification, so as to obtain the word segmentation result of the text to be segmented.
Compared with prior art, the participle device of Chinese text disclosed by the invention is received by initial word-dividing mode treats point
Word text, the text to be segmented initially is segmented after carrying out matching treatment according to standard dictionary, then pass through scene Recognition
Module to it is described after participle text carry out initially segment after, by CRF models, the scene of text to be segmented described in identification, then
By ambiguity identification module scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, so as to
The word segmentation result of text to be segmented described in acquisition, can effectively solve existing dictionary and be not based on different scene progress ambiguity processing
Problem, depth recognition is carried out to the confusing meaning of appearance of same section of vocabulary, accuracy is high.
As the improvement of such scheme, described device also includes:
Long word simplifies processing module, for the long word in the word segmentation result to be carried out into letter by the abbreviation model trained
Change abbreviation corresponding to processing acquisition, the abbreviation and the standard dictionary are subjected to matching verification.
As the improvement of such scheme, the standard dictionary is with polynary semantic network lexicon, passes through following steps
It is specially with polynary semantic network lexicon described in obtaining:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
Brief description of the drawings
Fig. 1 is a kind of schematic flow sheet of the segmenting method of Chinese text in the embodiment of the present invention 1.
Fig. 2 is a kind of schematic flow sheet of the segmenting method of Chinese text in the embodiment of the present invention 2.
Fig. 3 is a kind of structural representation of the participle device of Chinese text in the embodiment of the present invention 3.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made
Embodiment, belong to the scope of protection of the invention.
It is a kind of schematic flow sheet of the segmenting method for Chinese text that the embodiment of the present invention 1 provides referring to Fig. 1, including
Step:
S1, text to be segmented is received, the text to be segmented is carried out initially after carrying out matching treatment according to standard dictionary
Participle;
Wherein, the standard dictionary be with polynary semantic network lexicon, and traditional two-dimentional dictionary difference be can
To support the entry cutting based on natural language processing, and more rich extended attribute can be provided
S2, to it is described after participle text carry out initially segment after, by CRF models, the field of text to be segmented described in identification
Scape;
Wherein, CRF (Conditional Random Field) condition random field is natural language processing field in recent years
One of conventional algorithm, it is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..
S3, scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain
State the word segmentation result of text to be segmented.
When it is implemented, by receiving text to be segmented, wait to segment to described after carrying out matching treatment according to standard dictionary
Text is initially segmented, then, by CRF models, waits to segment described in identification after participle text initially segment to described
The scene of text, then scene to be segmented according to, carries out ambiguity identification, so as to obtain to the entry of the text to be segmented
The word segmentation result of the text to be segmented is obtained, existing dictionary can be effectively solved and be not based on asking for different scene progress ambiguity processing
Topic, depth recognition is carried out to the confusing meaning of appearance of same section of vocabulary, accuracy is high.
Preferably, as shown in Fig. 2 on the basis of embodiment 1, in addition to step:
S4, the long word in the word segmentation result simplify corresponding to processing acquisition by the abbreviation model trained and contracted
Word is write, the abbreviation and the standard dictionary are subjected to matching verification.
By such scheme, it can automatically generate and abridge and carry out verification matching, many manual sorting abbreviations can be saved
The workload of word.
Preferably, by the way that there is polynary semantic network lexicon described in following steps acquisition:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
It is a kind of structural representation of the participle device for Chinese text that the embodiment of the present invention 3 provides referring to Fig. 3, including:
Initial word-dividing mode 101, for receiving text to be segmented, treated after carrying out matching treatment according to standard dictionary to described
Participle text is initially segmented;
Scene Recognition module 102, for after participle text initially segment, by CRF models, identifying institute to described
State the scene of text to be segmented;
Ambiguity identification module 103, for scene to be segmented according to, discrimination is carried out to the entry of the text to be segmented
Justice identification, so as to obtain the word segmentation result of the text to be segmented.
When it is implemented, receiving text to be segmented by initial word-dividing mode 101, matching treatment is carried out according to standard dictionary
The text to be segmented initially is segmented afterwards, then the text to be segmented carried out initially by scene Recognition module 102
After participle, by CRF models, the scene of text to be segmented described in identification, then treated by ambiguity identification module 103 according to described
The scene of participle, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain the participle knot of the text to be segmented
Fruit, can effectively solve the problems, such as that existing dictionary is not based on different scenes and carries out ambiguity processing, same section of the easy of vocabulary is mixed
The meaning confused carries out depth recognition, and accuracy is high.
Preferably, the participle device 100 of the Chinese text also includes:
Long word simplifies processing module, for the long word in the word segmentation result to be carried out into letter by the abbreviation model trained
Change abbreviation corresponding to processing acquisition, the abbreviation and the standard dictionary are subjected to matching verification.
Wherein, the standard dictionary is with polynary semantic network lexicon, is obtained by following steps described with more
First semantic network lexicon is specially:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
To sum up, the embodiment of the invention discloses a kind of segmenting method of Chinese text and device, participle text is treated by receiving
This, is initially segmented after carrying out matching treatment according to standard dictionary to the text to be segmented, then to the text to be segmented
Carry out after initially segmenting, by CRF models, the scene of text segment described in identification, then scene to be segmented described in basis,
Ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain the word segmentation result of the text to be segmented, can effectively be solved
Certainly existing dictionary is not based on the problem of different scenes carries out ambiguity processing, and the confusing meaning of appearance of same section of vocabulary is carried out
Depth recognition, accuracy are high.
Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications are also considered as
Protection scope of the present invention.
Claims (6)
1. a kind of segmenting method of Chinese text, it is characterised in that including step:
Text to be segmented is received, the text to be segmented initially is segmented after carrying out matching treatment according to standard dictionary;
To described after participle text initially segment, by CRF models, the scene of text to be segmented described in identification;
According to the scene to be segmented, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain described treat point
The word segmentation result of word text.
2. the segmenting method of Chinese text as claimed in claim 1, it is characterised in that methods described also includes step:
Long word in the word segmentation result by the abbreviation model trained simplify and handles abbreviation corresponding to acquisition, will
The abbreviation carries out matching verification with the standard dictionary.
3. the segmenting method of Chinese text as claimed in claim 1, it is characterised in that the standard dictionary is with polynary language
The network lexicon of justice, by the way that there is polynary semantic network lexicon described in following steps acquisition:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
A kind of 4. participle device of Chinese text, it is characterised in that including:
Initial word-dividing mode, for receiving text to be segmented, participle text is treated to described after carrying out matching treatment according to standard dictionary
This progress initially segments;
Scene Recognition module, for after participle text initially segment, by CRF models, being treated to described described in identification point
The scene of word text;
Ambiguity identification module, for scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented,
So as to obtain the word segmentation result of the text to be segmented.
5. the participle device of Chinese text as claimed in claim 4, it is characterised in that described device also includes:
Long word simplifies processing module, for carrying out simplifying place the long word in the word segmentation result by the abbreviation model trained
Abbreviation corresponding to reason acquisition, the abbreviation and the standard dictionary are subjected to matching verification.
6. the participle device of Chinese text as claimed in claim 4, it is characterised in that the standard dictionary is with polynary language
The network lexicon of justice, it is described by following steps acquisition to be specially with polynary semantic network lexicon:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710581114.3A CN107480128A (en) | 2017-07-17 | 2017-07-17 | The segmenting method and device of Chinese text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710581114.3A CN107480128A (en) | 2017-07-17 | 2017-07-17 | The segmenting method and device of Chinese text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107480128A true CN107480128A (en) | 2017-12-15 |
Family
ID=60596786
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710581114.3A Pending CN107480128A (en) | 2017-07-17 | 2017-07-17 | The segmenting method and device of Chinese text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107480128A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108170678A (en) * | 2017-12-27 | 2018-06-15 | 广州市云润大数据服务有限公司 | A kind of text entities abstracting method and system |
CN111401388A (en) * | 2018-12-13 | 2020-07-10 | 北京嘀嘀无限科技发展有限公司 | Data mining method, device, server and readable storage medium |
CN111950283A (en) * | 2020-07-31 | 2020-11-17 | 合肥工业大学 | Chinese word segmentation and named entity recognition system for large-scale medical text mining |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080147381A1 (en) * | 2006-12-13 | 2008-06-19 | Microsoft Corporation | Compound word splitting for directory assistance services |
CN102402502A (en) * | 2011-11-24 | 2012-04-04 | 北京趣拿信息技术有限公司 | Word segmentation processing method and device for search engine |
CN103020034A (en) * | 2011-09-26 | 2013-04-03 | 北京大学 | Chinese words segmentation method and device |
CN103164471A (en) * | 2011-12-15 | 2013-06-19 | 盛乐信息技术(上海)有限公司 | Recommendation method and system of video text labels |
CN106202039A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese portmanteau word disambiguation method based on condition random field |
CN106815293A (en) * | 2016-12-08 | 2017-06-09 | 中国电子科技集团公司第三十二研究所 | System and method for constructing knowledge graph for information analysis |
-
2017
- 2017-07-17 CN CN201710581114.3A patent/CN107480128A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080147381A1 (en) * | 2006-12-13 | 2008-06-19 | Microsoft Corporation | Compound word splitting for directory assistance services |
CN103020034A (en) * | 2011-09-26 | 2013-04-03 | 北京大学 | Chinese words segmentation method and device |
CN102402502A (en) * | 2011-11-24 | 2012-04-04 | 北京趣拿信息技术有限公司 | Word segmentation processing method and device for search engine |
CN103164471A (en) * | 2011-12-15 | 2013-06-19 | 盛乐信息技术(上海)有限公司 | Recommendation method and system of video text labels |
CN106202039A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese portmanteau word disambiguation method based on condition random field |
CN106815293A (en) * | 2016-12-08 | 2017-06-09 | 中国电子科技集团公司第三十二研究所 | System and method for constructing knowledge graph for information analysis |
Non-Patent Citations (6)
Title |
---|
丁德鑫等: "基于CRF模型的组合型歧义消解研究 ", 《南京师范大学学报(工程技术版)》 * |
丁德鑫等: "基于CRF模型的组合型歧义消解研究", 《南京师范大学学报(工程技术版)》 * |
曲维光等: "基于语境信息的组合型分词歧义消解方法 ", 《计算机工程》 * |
曲维光等: "基于语境信息的组合型分词歧义消解方法", 《计算机工程》 * |
车玲等: "面向词义消歧的条件随机场模型库构建 ", 《计算机工程》 * |
车玲等: "面向词义消歧的条件随机场模型库构建", 《计算机工程》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108170678A (en) * | 2017-12-27 | 2018-06-15 | 广州市云润大数据服务有限公司 | A kind of text entities abstracting method and system |
CN111401388A (en) * | 2018-12-13 | 2020-07-10 | 北京嘀嘀无限科技发展有限公司 | Data mining method, device, server and readable storage medium |
CN111401388B (en) * | 2018-12-13 | 2023-06-30 | 北京嘀嘀无限科技发展有限公司 | Data mining method, device, server and readable storage medium |
CN111950283A (en) * | 2020-07-31 | 2020-11-17 | 合肥工业大学 | Chinese word segmentation and named entity recognition system for large-scale medical text mining |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109726293B (en) | Causal event map construction method, system, device and storage medium | |
CN104142915B (en) | A kind of method and system adding punctuate | |
CN107402916A (en) | The segmenting method and device of Chinese text | |
CN109635297B (en) | Entity disambiguation method and device, computer device and computer storage medium | |
JP7096919B2 (en) | Entity word recognition method and device | |
CN112948543A (en) | Multi-language multi-document abstract extraction method based on weighted TextRank | |
CN103077164A (en) | Text analysis method and text analyzer | |
CN106446018B (en) | Query information processing method and device based on artificial intelligence | |
CN108681574A (en) | A kind of non-true class quiz answers selection method and system based on text snippet | |
CN104809117A (en) | Video data aggregation processing method, aggregation system and video searching platform | |
CN108038099B (en) | Low-frequency keyword identification method based on word clustering | |
CN106933800A (en) | A kind of event sentence abstracting method of financial field | |
CN110674378A (en) | Chinese semantic recognition method based on cosine similarity and minimum editing distance | |
CN107480128A (en) | The segmenting method and device of Chinese text | |
CN103559181A (en) | Establishment method and system for bilingual semantic relation classification model | |
CN103955450A (en) | Automatic extraction method of new words | |
CN106528694A (en) | Artificial intelligence-based semantic judgment processing method and apparatus | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN105244024B (en) | A kind of audio recognition method and device | |
CN113806483A (en) | Data processing method and device, electronic equipment and computer program product | |
WO2024138859A1 (en) | Cross-language entity word retrieval method, apparatus and device, and storage medium | |
CN111538805A (en) | Text information extraction method and system based on deep learning and rule engine | |
CN106776590A (en) | A kind of method and system for obtaining entry translation | |
CN111310452A (en) | Word segmentation method and device | |
CN111897958B (en) | Ancient poetry classification method based on natural language processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171215 |
|
RJ01 | Rejection of invention patent application after publication |