Nothing Special   »   [go: up one dir, main page]

CN107480128A - The segmenting method and device of Chinese text - Google Patents

The segmenting method and device of Chinese text Download PDF

Info

Publication number
CN107480128A
CN107480128A CN201710581114.3A CN201710581114A CN107480128A CN 107480128 A CN107480128 A CN 107480128A CN 201710581114 A CN201710581114 A CN 201710581114A CN 107480128 A CN107480128 A CN 107480128A
Authority
CN
China
Prior art keywords
text
segmented
dictionary
word
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710581114.3A
Other languages
Chinese (zh)
Inventor
晋彤
李永康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Special Road Mdt Infotech Ltd
Original Assignee
Guangzhou Special Road Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Special Road Mdt Infotech Ltd filed Critical Guangzhou Special Road Mdt Infotech Ltd
Priority to CN201710581114.3A priority Critical patent/CN107480128A/en
Publication of CN107480128A publication Critical patent/CN107480128A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)

Abstract

The embodiment of the invention discloses a kind of segmenting method of Chinese text and device, by receiving text to be segmented, the text to be segmented initially is segmented after carrying out matching treatment according to standard dictionary, again to described after participle text initially segment, pass through CRF models, the scene of text to be segmented described in identification, then scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain the word segmentation result of the text to be segmented, can effectively solve the problems, such as that existing dictionary is not based on different scenes and carries out ambiguity processing, depth recognition is carried out to the confusing meaning of appearance of same section of vocabulary, accuracy is high.

Description

The segmenting method and device of Chinese text
Technical field
The present invention relates to the segmenting method and device of computer realm, more particularly to a kind of Chinese text.
Background technology
Chinese word segmentation refers to is cut into single word one by one by a Chinese character sequence.Chinese word segmentation is in information retrieval, machine Device is translated to play an important role with fields such as speech recognitions, is a link essential in Chinese speech processing procedure.Typically Ground, due to the problem of segmentation ambiguity be present, the degree of accuracy of traditional mechanical segmentation method based on dictionary does not reach 100%.Example Such as, " the Nanjing Yangtze Bridge " is segmented into " the Nanjing Yangtze Bridge ", can also be divided into " the Nanjing Yangtze Bridge ".If no Other knowledge are relied on, two kinds of participle modes seem it is all rational
The dictionary that existing participle technique relies on is two-dimentional, at most there was only part of speech and simple probability right.In algorithm On model, it is not based on different scenes and carries out ambiguity processing.The abbreviation entry of domestic dictionary manually adds at present, compares Machinery.
The content of the invention
The purpose of the embodiment of the present invention is to provide the segmenting method and device of a kind of Chinese text, can effectively solve existing word Storehouse is not based on the problem of different scenes carries out ambiguity processing, and accuracy is high.
To achieve the above object, the embodiments of the invention provide a kind of segmenting method of Chinese text, including step:
Text to be segmented is received, the text to be segmented initially is divided after carrying out matching treatment according to standard dictionary Word;
To described after participle text initially segment, by CRF models, the scene of text to be segmented described in identification;
According to the scene to be segmented, ambiguity identification is carried out to the entry of the text to be segmented, it is described so as to obtain The word segmentation result of text to be segmented.
Compared with prior art, the segmenting method of Chinese text disclosed by the invention is by receiving text to be segmented, according to Standard dictionary is initially segmented after carrying out matching treatment to the text to be segmented, then the text to be segmented is carried out initial After participle, by CRF models, the scene of text to be segmented described in identification, then scene to be segmented according to, is treated to described The entry for segmenting text carries out ambiguity identification, so as to obtain the word segmentation result of the text to be segmented, can effectively solve existing word Storehouse is not based on the problem of different scenes carries out ambiguity processing, and depth knowledge is carried out to the confusing meaning of appearance of same section of vocabulary Not, accuracy is high.
As the improvement of such scheme, in addition to step:
Long word in the word segmentation result simplify corresponding to processing acquisition by the abbreviation model trained and abridged Word, the abbreviation and the standard dictionary are subjected to matching verification.
As the improvement of such scheme, the standard dictionary is with polynary semantic network lexicon, passes through following steps There is polynary semantic network lexicon described in obtaining:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
The embodiment of the present invention additionally provides a kind of participle device of Chinese text, including:
Initial word-dividing mode, for receiving text to be segmented, treated point to described after carrying out matching treatment according to standard dictionary Word text is initially segmented;
Scene Recognition module, for described after participle text initially segment, by CRF models, described in identification The scene of text to be segmented;
Ambiguity identification module, for scene to be segmented according to, ambiguity is carried out to the entry of the text to be segmented Identification, so as to obtain the word segmentation result of the text to be segmented.
Compared with prior art, the participle device of Chinese text disclosed by the invention is received by initial word-dividing mode treats point Word text, the text to be segmented initially is segmented after carrying out matching treatment according to standard dictionary, then pass through scene Recognition Module to it is described after participle text carry out initially segment after, by CRF models, the scene of text to be segmented described in identification, then By ambiguity identification module scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, so as to The word segmentation result of text to be segmented described in acquisition, can effectively solve existing dictionary and be not based on different scene progress ambiguity processing Problem, depth recognition is carried out to the confusing meaning of appearance of same section of vocabulary, accuracy is high.
As the improvement of such scheme, described device also includes:
Long word simplifies processing module, for the long word in the word segmentation result to be carried out into letter by the abbreviation model trained Change abbreviation corresponding to processing acquisition, the abbreviation and the standard dictionary are subjected to matching verification.
As the improvement of such scheme, the standard dictionary is with polynary semantic network lexicon, passes through following steps It is specially with polynary semantic network lexicon described in obtaining:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
Brief description of the drawings
Fig. 1 is a kind of schematic flow sheet of the segmenting method of Chinese text in the embodiment of the present invention 1.
Fig. 2 is a kind of schematic flow sheet of the segmenting method of Chinese text in the embodiment of the present invention 2.
Fig. 3 is a kind of structural representation of the participle device of Chinese text in the embodiment of the present invention 3.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.
It is a kind of schematic flow sheet of the segmenting method for Chinese text that the embodiment of the present invention 1 provides referring to Fig. 1, including Step:
S1, text to be segmented is received, the text to be segmented is carried out initially after carrying out matching treatment according to standard dictionary Participle;
Wherein, the standard dictionary be with polynary semantic network lexicon, and traditional two-dimentional dictionary difference be can To support the entry cutting based on natural language processing, and more rich extended attribute can be provided
S2, to it is described after participle text carry out initially segment after, by CRF models, the field of text to be segmented described in identification Scape;
Wherein, CRF (Conditional Random Field) condition random field is natural language processing field in recent years One of conventional algorithm, it is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..
S3, scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain State the word segmentation result of text to be segmented.
When it is implemented, by receiving text to be segmented, wait to segment to described after carrying out matching treatment according to standard dictionary Text is initially segmented, then, by CRF models, waits to segment described in identification after participle text initially segment to described The scene of text, then scene to be segmented according to, carries out ambiguity identification, so as to obtain to the entry of the text to be segmented The word segmentation result of the text to be segmented is obtained, existing dictionary can be effectively solved and be not based on asking for different scene progress ambiguity processing Topic, depth recognition is carried out to the confusing meaning of appearance of same section of vocabulary, accuracy is high.
Preferably, as shown in Fig. 2 on the basis of embodiment 1, in addition to step:
S4, the long word in the word segmentation result simplify corresponding to processing acquisition by the abbreviation model trained and contracted Word is write, the abbreviation and the standard dictionary are subjected to matching verification.
By such scheme, it can automatically generate and abridge and carry out verification matching, many manual sorting abbreviations can be saved The workload of word.
Preferably, by the way that there is polynary semantic network lexicon described in following steps acquisition:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
It is a kind of structural representation of the participle device for Chinese text that the embodiment of the present invention 3 provides referring to Fig. 3, including:
Initial word-dividing mode 101, for receiving text to be segmented, treated after carrying out matching treatment according to standard dictionary to described Participle text is initially segmented;
Scene Recognition module 102, for after participle text initially segment, by CRF models, identifying institute to described State the scene of text to be segmented;
Ambiguity identification module 103, for scene to be segmented according to, discrimination is carried out to the entry of the text to be segmented Justice identification, so as to obtain the word segmentation result of the text to be segmented.
When it is implemented, receiving text to be segmented by initial word-dividing mode 101, matching treatment is carried out according to standard dictionary The text to be segmented initially is segmented afterwards, then the text to be segmented carried out initially by scene Recognition module 102 After participle, by CRF models, the scene of text to be segmented described in identification, then treated by ambiguity identification module 103 according to described The scene of participle, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain the participle knot of the text to be segmented Fruit, can effectively solve the problems, such as that existing dictionary is not based on different scenes and carries out ambiguity processing, same section of the easy of vocabulary is mixed The meaning confused carries out depth recognition, and accuracy is high.
Preferably, the participle device 100 of the Chinese text also includes:
Long word simplifies processing module, for the long word in the word segmentation result to be carried out into letter by the abbreviation model trained Change abbreviation corresponding to processing acquisition, the abbreviation and the standard dictionary are subjected to matching verification.
Wherein, the standard dictionary is with polynary semantic network lexicon, is obtained by following steps described with more First semantic network lexicon is specially:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
To sum up, the embodiment of the invention discloses a kind of segmenting method of Chinese text and device, participle text is treated by receiving This, is initially segmented after carrying out matching treatment according to standard dictionary to the text to be segmented, then to the text to be segmented Carry out after initially segmenting, by CRF models, the scene of text segment described in identification, then scene to be segmented described in basis, Ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain the word segmentation result of the text to be segmented, can effectively be solved Certainly existing dictionary is not based on the problem of different scenes carries out ambiguity processing, and the confusing meaning of appearance of same section of vocabulary is carried out Depth recognition, accuracy are high.
Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications are also considered as Protection scope of the present invention.

Claims (6)

1. a kind of segmenting method of Chinese text, it is characterised in that including step:
Text to be segmented is received, the text to be segmented initially is segmented after carrying out matching treatment according to standard dictionary;
To described after participle text initially segment, by CRF models, the scene of text to be segmented described in identification;
According to the scene to be segmented, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain described treat point The word segmentation result of word text.
2. the segmenting method of Chinese text as claimed in claim 1, it is characterised in that methods described also includes step:
Long word in the word segmentation result by the abbreviation model trained simplify and handles abbreviation corresponding to acquisition, will The abbreviation carries out matching verification with the standard dictionary.
3. the segmenting method of Chinese text as claimed in claim 1, it is characterised in that the standard dictionary is with polynary language The network lexicon of justice, by the way that there is polynary semantic network lexicon described in following steps acquisition:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
A kind of 4. participle device of Chinese text, it is characterised in that including:
Initial word-dividing mode, for receiving text to be segmented, participle text is treated to described after carrying out matching treatment according to standard dictionary This progress initially segments;
Scene Recognition module, for after participle text initially segment, by CRF models, being treated to described described in identification point The scene of word text;
Ambiguity identification module, for scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, So as to obtain the word segmentation result of the text to be segmented.
5. the participle device of Chinese text as claimed in claim 4, it is characterised in that described device also includes:
Long word simplifies processing module, for carrying out simplifying place the long word in the word segmentation result by the abbreviation model trained Abbreviation corresponding to reason acquisition, the abbreviation and the standard dictionary are subjected to matching verification.
6. the participle device of Chinese text as claimed in claim 4, it is characterised in that the standard dictionary is with polynary language The network lexicon of justice, it is described by following steps acquisition to be specially with polynary semantic network lexicon:
General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.
CN201710581114.3A 2017-07-17 2017-07-17 The segmenting method and device of Chinese text Pending CN107480128A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710581114.3A CN107480128A (en) 2017-07-17 2017-07-17 The segmenting method and device of Chinese text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710581114.3A CN107480128A (en) 2017-07-17 2017-07-17 The segmenting method and device of Chinese text

Publications (1)

Publication Number Publication Date
CN107480128A true CN107480128A (en) 2017-12-15

Family

ID=60596786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710581114.3A Pending CN107480128A (en) 2017-07-17 2017-07-17 The segmenting method and device of Chinese text

Country Status (1)

Country Link
CN (1) CN107480128A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170678A (en) * 2017-12-27 2018-06-15 广州市云润大数据服务有限公司 A kind of text entities abstracting method and system
CN111401388A (en) * 2018-12-13 2020-07-10 北京嘀嘀无限科技发展有限公司 Data mining method, device, server and readable storage medium
CN111950283A (en) * 2020-07-31 2020-11-17 合肥工业大学 Chinese word segmentation and named entity recognition system for large-scale medical text mining

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147381A1 (en) * 2006-12-13 2008-06-19 Microsoft Corporation Compound word splitting for directory assistance services
CN102402502A (en) * 2011-11-24 2012-04-04 北京趣拿信息技术有限公司 Word segmentation processing method and device for search engine
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN103164471A (en) * 2011-12-15 2013-06-19 盛乐信息技术(上海)有限公司 Recommendation method and system of video text labels
CN106202039A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese portmanteau word disambiguation method based on condition random field
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147381A1 (en) * 2006-12-13 2008-06-19 Microsoft Corporation Compound word splitting for directory assistance services
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN102402502A (en) * 2011-11-24 2012-04-04 北京趣拿信息技术有限公司 Word segmentation processing method and device for search engine
CN103164471A (en) * 2011-12-15 2013-06-19 盛乐信息技术(上海)有限公司 Recommendation method and system of video text labels
CN106202039A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese portmanteau word disambiguation method based on condition random field
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
丁德鑫等: "基于CRF模型的组合型歧义消解研究 ", 《南京师范大学学报(工程技术版)》 *
丁德鑫等: "基于CRF模型的组合型歧义消解研究", 《南京师范大学学报(工程技术版)》 *
曲维光等: "基于语境信息的组合型分词歧义消解方法 ", 《计算机工程》 *
曲维光等: "基于语境信息的组合型分词歧义消解方法", 《计算机工程》 *
车玲等: "面向词义消歧的条件随机场模型库构建 ", 《计算机工程》 *
车玲等: "面向词义消歧的条件随机场模型库构建", 《计算机工程》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170678A (en) * 2017-12-27 2018-06-15 广州市云润大数据服务有限公司 A kind of text entities abstracting method and system
CN111401388A (en) * 2018-12-13 2020-07-10 北京嘀嘀无限科技发展有限公司 Data mining method, device, server and readable storage medium
CN111401388B (en) * 2018-12-13 2023-06-30 北京嘀嘀无限科技发展有限公司 Data mining method, device, server and readable storage medium
CN111950283A (en) * 2020-07-31 2020-11-17 合肥工业大学 Chinese word segmentation and named entity recognition system for large-scale medical text mining

Similar Documents

Publication Publication Date Title
CN109726293B (en) Causal event map construction method, system, device and storage medium
CN104142915B (en) A kind of method and system adding punctuate
CN107402916A (en) The segmenting method and device of Chinese text
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
JP7096919B2 (en) Entity word recognition method and device
CN112948543A (en) Multi-language multi-document abstract extraction method based on weighted TextRank
CN103077164A (en) Text analysis method and text analyzer
CN106446018B (en) Query information processing method and device based on artificial intelligence
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN104809117A (en) Video data aggregation processing method, aggregation system and video searching platform
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN106933800A (en) A kind of event sentence abstracting method of financial field
CN110674378A (en) Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN107480128A (en) The segmenting method and device of Chinese text
CN103559181A (en) Establishment method and system for bilingual semantic relation classification model
CN103955450A (en) Automatic extraction method of new words
CN106528694A (en) Artificial intelligence-based semantic judgment processing method and apparatus
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN105244024B (en) A kind of audio recognition method and device
CN113806483A (en) Data processing method and device, electronic equipment and computer program product
WO2024138859A1 (en) Cross-language entity word retrieval method, apparatus and device, and storage medium
CN111538805A (en) Text information extraction method and system based on deep learning and rule engine
CN106776590A (en) A kind of method and system for obtaining entry translation
CN111310452A (en) Word segmentation method and device
CN111897958B (en) Ancient poetry classification method based on natural language processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171215

RJ01 Rejection of invention patent application after publication