CN107480128A

CN107480128A - The segmenting method and device of Chinese text

Info

Publication number: CN107480128A
Application number: CN201710581114.3A
Authority: CN
Inventors: 晋彤; 李永康
Original assignee: Guangzhou Special Road Mdt Infotech Ltd
Current assignee: Guangzhou Special Road Mdt Infotech Ltd
Priority date: 2017-07-17
Filing date: 2017-07-17
Publication date: 2017-12-15

Abstract

The embodiment of the invention discloses a kind of segmenting method of Chinese text and device, by receiving text to be segmented, the text to be segmented initially is segmented after carrying out matching treatment according to standard dictionary, again to described after participle text initially segment, pass through CRF models, the scene of text to be segmented described in identification, then scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain the word segmentation result of the text to be segmented, can effectively solve the problems, such as that existing dictionary is not based on different scenes and carries out ambiguity processing, depth recognition is carried out to the confusing meaning of appearance of same section of vocabulary, accuracy is high.

Description

The segmenting method and device of Chinese text

Technical field

The present invention relates to the segmenting method and device of computer realm, more particularly to a kind of Chinese text.

Background technology

Chinese word segmentation refers to is cut into single word one by one by a Chinese character sequence.Chinese word segmentation is in information retrieval, machine Device is translated to play an important role with fields such as speech recognitions, is a link essential in Chinese speech processing procedure.Typically Ground, due to the problem of segmentation ambiguity be present, the degree of accuracy of traditional mechanical segmentation method based on dictionary does not reach 100%.Example Such as, " the Nanjing Yangtze Bridge " is segmented into " the Nanjing Yangtze Bridge ", can also be divided into " the Nanjing Yangtze Bridge ".If no Other knowledge are relied on, two kinds of participle modes seem it is all rational

The dictionary that existing participle technique relies on is two-dimentional, at most there was only part of speech and simple probability right.In algorithm On model, it is not based on different scenes and carries out ambiguity processing.The abbreviation entry of domestic dictionary manually adds at present, compares Machinery.

The content of the invention

The purpose of the embodiment of the present invention is to provide the segmenting method and device of a kind of Chinese text, can effectively solve existing word Storehouse is not based on the problem of different scenes carries out ambiguity processing, and accuracy is high.

To achieve the above object, the embodiments of the invention provide a kind of segmenting method of Chinese text, including step：

Text to be segmented is received, the text to be segmented initially is divided after carrying out matching treatment according to standard dictionary Word；

To described after participle text initially segment, by CRF models, the scene of text to be segmented described in identification；

According to the scene to be segmented, ambiguity identification is carried out to the entry of the text to be segmented, it is described so as to obtain The word segmentation result of text to be segmented.

Compared with prior art, the segmenting method of Chinese text disclosed by the invention is by receiving text to be segmented, according to Standard dictionary is initially segmented after carrying out matching treatment to the text to be segmented, then the text to be segmented is carried out initial After participle, by CRF models, the scene of text to be segmented described in identification, then scene to be segmented according to, is treated to described The entry for segmenting text carries out ambiguity identification, so as to obtain the word segmentation result of the text to be segmented, can effectively solve existing word Storehouse is not based on the problem of different scenes carries out ambiguity processing, and depth knowledge is carried out to the confusing meaning of appearance of same section of vocabulary Not, accuracy is high.

As the improvement of such scheme, in addition to step：

Long word in the word segmentation result simplify corresponding to processing acquisition by the abbreviation model trained and abridged Word, the abbreviation and the standard dictionary are subjected to matching verification.

As the improvement of such scheme, the standard dictionary is with polynary semantic network lexicon, passes through following steps There is polynary semantic network lexicon described in obtaining：

General dictionary, entity dictionary and field is collected to expect to generate polynary semantic network lexicon after being merged.

The embodiment of the present invention additionally provides a kind of participle device of Chinese text, including：

Initial word-dividing mode, for receiving text to be segmented, treated point to described after carrying out matching treatment according to standard dictionary Word text is initially segmented；

Scene Recognition module, for described after participle text initially segment, by CRF models, described in identification The scene of text to be segmented；

Ambiguity identification module, for scene to be segmented according to, ambiguity is carried out to the entry of the text to be segmented Identification, so as to obtain the word segmentation result of the text to be segmented.

Compared with prior art, the participle device of Chinese text disclosed by the invention is received by initial word-dividing mode treats point Word text, the text to be segmented initially is segmented after carrying out matching treatment according to standard dictionary, then pass through scene Recognition Module to it is described after participle text carry out initially segment after, by CRF models, the scene of text to be segmented described in identification, then By ambiguity identification module scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, so as to The word segmentation result of text to be segmented described in acquisition, can effectively solve existing dictionary and be not based on different scene progress ambiguity processing Problem, depth recognition is carried out to the confusing meaning of appearance of same section of vocabulary, accuracy is high.

As the improvement of such scheme, described device also includes：

Long word simplifies processing module, for the long word in the word segmentation result to be carried out into letter by the abbreviation model trained Change abbreviation corresponding to processing acquisition, the abbreviation and the standard dictionary are subjected to matching verification.

As the improvement of such scheme, the standard dictionary is with polynary semantic network lexicon, passes through following steps It is specially with polynary semantic network lexicon described in obtaining：

Brief description of the drawings

Fig. 1 is a kind of schematic flow sheet of the segmenting method of Chinese text in the embodiment of the present invention 1.

Fig. 2 is a kind of schematic flow sheet of the segmenting method of Chinese text in the embodiment of the present invention 2.

Fig. 3 is a kind of structural representation of the participle device of Chinese text in the embodiment of the present invention 3.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

It is a kind of schematic flow sheet of the segmenting method for Chinese text that the embodiment of the present invention 1 provides referring to Fig. 1, including Step：

S1, text to be segmented is received, the text to be segmented is carried out initially after carrying out matching treatment according to standard dictionary Participle；

Wherein, the standard dictionary be with polynary semantic network lexicon, and traditional two-dimentional dictionary difference be can To support the entry cutting based on natural language processing, and more rich extended attribute can be provided

S2, to it is described after participle text carry out initially segment after, by CRF models, the field of text to be segmented described in identification Scape；

Wherein, CRF (Conditional Random Field) condition random field is natural language processing field in recent years One of conventional algorithm, it is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..

S3, scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain State the word segmentation result of text to be segmented.

When it is implemented, by receiving text to be segmented, wait to segment to described after carrying out matching treatment according to standard dictionary Text is initially segmented, then, by CRF models, waits to segment described in identification after participle text initially segment to described The scene of text, then scene to be segmented according to, carries out ambiguity identification, so as to obtain to the entry of the text to be segmented The word segmentation result of the text to be segmented is obtained, existing dictionary can be effectively solved and be not based on asking for different scene progress ambiguity processing Topic, depth recognition is carried out to the confusing meaning of appearance of same section of vocabulary, accuracy is high.

Preferably, as shown in Fig. 2 on the basis of embodiment 1, in addition to step：

S4, the long word in the word segmentation result simplify corresponding to processing acquisition by the abbreviation model trained and contracted Word is write, the abbreviation and the standard dictionary are subjected to matching verification.

By such scheme, it can automatically generate and abridge and carry out verification matching, many manual sorting abbreviations can be saved The workload of word.

Preferably, by the way that there is polynary semantic network lexicon described in following steps acquisition：

It is a kind of structural representation of the participle device for Chinese text that the embodiment of the present invention 3 provides referring to Fig. 3, including：

Initial word-dividing mode 101, for receiving text to be segmented, treated after carrying out matching treatment according to standard dictionary to described Participle text is initially segmented；

Scene Recognition module 102, for after participle text initially segment, by CRF models, identifying institute to described State the scene of text to be segmented；

Ambiguity identification module 103, for scene to be segmented according to, discrimination is carried out to the entry of the text to be segmented Justice identification, so as to obtain the word segmentation result of the text to be segmented.

When it is implemented, receiving text to be segmented by initial word-dividing mode 101, matching treatment is carried out according to standard dictionary The text to be segmented initially is segmented afterwards, then the text to be segmented carried out initially by scene Recognition module 102 After participle, by CRF models, the scene of text to be segmented described in identification, then treated by ambiguity identification module 103 according to described The scene of participle, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain the participle knot of the text to be segmented Fruit, can effectively solve the problems, such as that existing dictionary is not based on different scenes and carries out ambiguity processing, same section of the easy of vocabulary is mixed The meaning confused carries out depth recognition, and accuracy is high.

Preferably, the participle device 100 of the Chinese text also includes：

Wherein, the standard dictionary is with polynary semantic network lexicon, is obtained by following steps described with more First semantic network lexicon is specially：

To sum up, the embodiment of the invention discloses a kind of segmenting method of Chinese text and device, participle text is treated by receiving This, is initially segmented after carrying out matching treatment according to standard dictionary to the text to be segmented, then to the text to be segmented Carry out after initially segmenting, by CRF models, the scene of text segment described in identification, then scene to be segmented described in basis, Ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain the word segmentation result of the text to be segmented, can effectively be solved Certainly existing dictionary is not based on the problem of different scenes carries out ambiguity processing, and the confusing meaning of appearance of same section of vocabulary is carried out Depth recognition, accuracy are high.

Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications are also considered as Protection scope of the present invention.

Claims

1. a kind of segmenting method of Chinese text, it is characterised in that including step：

Text to be segmented is received, the text to be segmented initially is segmented after carrying out matching treatment according to standard dictionary；

According to the scene to be segmented, ambiguity identification is carried out to the entry of the text to be segmented, so as to obtain described treat point The word segmentation result of word text.

2. the segmenting method of Chinese text as claimed in claim 1, it is characterised in that methods described also includes step：

Long word in the word segmentation result by the abbreviation model trained simplify and handles abbreviation corresponding to acquisition, will The abbreviation carries out matching verification with the standard dictionary.

3. the segmenting method of Chinese text as claimed in claim 1, it is characterised in that the standard dictionary is with polynary language The network lexicon of justice, by the way that there is polynary semantic network lexicon described in following steps acquisition：

A kind of 4. participle device of Chinese text, it is characterised in that including：

Initial word-dividing mode, for receiving text to be segmented, participle text is treated to described after carrying out matching treatment according to standard dictionary This progress initially segments；

Scene Recognition module, for after participle text initially segment, by CRF models, being treated to described described in identification point The scene of word text；

Ambiguity identification module, for scene to be segmented according to, ambiguity identification is carried out to the entry of the text to be segmented, So as to obtain the word segmentation result of the text to be segmented.

5. the participle device of Chinese text as claimed in claim 4, it is characterised in that described device also includes：

Long word simplifies processing module, for carrying out simplifying place the long word in the word segmentation result by the abbreviation model trained Abbreviation corresponding to reason acquisition, the abbreviation and the standard dictionary are subjected to matching verification.

6. the participle device of Chinese text as claimed in claim 4, it is characterised in that the standard dictionary is with polynary language The network lexicon of justice, it is described by following steps acquisition to be specially with polynary semantic network lexicon：