CN107357895A - A kind of processing method of the text representation based on bag of words - Google Patents
A kind of processing method of the text representation based on bag of words Download PDFInfo
- Publication number
- CN107357895A CN107357895A CN201710569638.0A CN201710569638A CN107357895A CN 107357895 A CN107357895 A CN 107357895A CN 201710569638 A CN201710569638 A CN 201710569638A CN 107357895 A CN107357895 A CN 107357895A
- Authority
- CN
- China
- Prior art keywords
- words
- feature
- text
- weight
- bag
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to computer application field, discloses a kind of processing method of the text representation based on bag of words, and this method segments to the text data set collected, goes stop words, goes the processing procedures such as low-frequency word, feature selecting;Then with the text after vector space model expression processing;Term vector is trained with the method for neutral net to the text after processing simultaneously;The weight of the Feature Words of bag of words is changed according to the similitude of term vector, obtains new text representation model.To handle text representation problem, the accuracy of classification is improved.
Description
Technical field
The invention belongs to computer application field, the processing side of more particularly to a kind of text representation based on bag of words
Method.
Background technology
At present, text-processing has been widely used in every field, and in general to text, it is necessary to be segmented, gone
Stop words, low-frequency word, feature selecting, text is then represented, finally carry out classification processing.Different countries are for text-processing
Research, acquired achievement are equally inconsistent.Relative to other countries, China falls behind relatively to the research and probe of text-processing, rises
Step is also than later.
Word segmentation processing, due to having space as nature delimiter between English word, therefore no longer need to segment.However,
When computer disposal Chinese text, it is necessary first to text is segmented, automatic word segmentation is to need computer will according to the meaning of one's words
Sentence cutting is rational word.When handling natural language, be all using word as minimum unit, participle it is accurate
Property directly affects the quality of text classification.
Feature selecting, if representing the text with all Feature Words in text, then the dimension of feature space is usual
More than 100,000, the space of such higher-dimension can make computational efficiency very low, or even can not complete to calculate.In fact, have in the text
The contribution of a little words is very weak, as adverbial word " " can all occur in nearly all text, can not as the feature of particular text, because
This it is nonsensical to ensuing classification.Therefore need to choose the word that can represent text from text and form new feature
Space, so as to reach the purpose of dimensionality reduction.
Text representation, the text of human intelligible is character encoding forms, and computer architecture is binary coded form, text
The effect of this expression is how text code to be converted into computer code, and enables a computer to carry out text message
Calculate.The selection of text representation directly influences the effect of text classification.Conventional text representation model is vector space model.
But it is zero to have the weights of many Feature Words in vector space model, it is not so preferable to cause classifying quality, and the present invention proposes
The feature weight in vector space model is changed, improves the degree of accuracy of classification.
Term vector is to expect to obtain the vector representation of each word with neutral net Natural Language Processing Models training text,
Approach application for being called Word2Vec neutral net language model of Google's exploitation, this method can catch linguistic context letter
Compressed data scale while breath.Word2Vec actually includes two kinds of different methods:Continuous Bag of Words
And Skip-gram (CBOW).CBOW target is based on context to predict the probability of current term.Skip-gram just phases
Instead:The probability of context (as shown in Figure 2) is predicted according to current term.Both approaches all by the use of artificial neural network as
Their sorting algorithm.Originally, each word is a random N-dimensional vector.After training, the algorithm using CBOW or
Person Skip-gram method obtains the optimal vector of each word.These present term vectors have captured the letter of context
Breath, it can be used for predicting the heartbeat conditions of unknown data.
The content of the invention
In order to solve the problems, such as text representation during prior art text-processing, the accuracy of text classification is improved.This
Invention provides a kind of processing method of the text representation based on bag of words, utilization space vector model bluebeard compound of the present invention to
The method of amount establishes text model, so as to whole text document carry out classification processing, improves the accuracy of classification.This hair
Bright technical scheme is:
The first step, pretreatment;
Text data set is segmented, goes stop words, removes low-frequency word, then carries out Feature Words selection;
Second step, text data set after pretreatment, is represented with bag of words;Described bag of words be with
TFIDF (term frequency-inverse document frequency) is the text representation model of weight;
3rd step, text data set after pretreatment, train to obtain word with neutral net Natural Language Processing Models
Vector;
4th step, the similitude of the term vector obtained according to the 3rd step change the Feature Words for the bag of words that second step obtains
Weight, obtain new text representation model.In the TFIDF weight matrix of the vector space model, each feature is corresponding special
One-dimensional in sign space, each text representation represents a Feature Words into a line in matrix, each row.Had in this matrix
The TFIDF weighted values of many Feature Words are zero, and these feature weights for being zero influence the effect of classification.It is zero for some
, the item that this is zero is changed according to the TFIDF values of the similitude of the term vector of neural metwork training word similar in n.
Concrete modification mode is:The TFIDF obtained for second step is the text representation model of weight, its corresponding feature weight matrix
Certain a line in some Feature Words t, if its feature weight WtIt is zero;
A kind of situation, then feature weight WtWith Feature Words t close word t1, t2, t3..., tnWeight Wt1, Wt2,
Wt3..., WtnCarry out approximate representation Wt, the similarity threshold m that the quantity n of similar word passes through controlling feature word size controls.
Wherein, S(t,tn)In be characterized word t and Feature Words tn similarity.
Another situation, then feature weight WtWith Feature Words t close word t1, t2, t3..., tnIn most close word weight
WiCarry out approximate representation Wt。
Wt=Wi*S(t,i) (2)
Wherein, S(t,i)In be characterized word t and Feature Words i similarity.
Further, for less data set, text data set after pretreatment is replicated n times, n is positive integer,
For the size of dilated data set, then train with neutral net Natural Language Processing Models to obtain term vector, it is so obtained
Term vector effect is more excellent.
The beneficial effects of the present invention are the method for, utilization space vector model combination term vector to establish text model,
So as to carry out classification processing to whole text document, the accuracy of classification is improved.
Brief description of the drawings
Text representation processing procedure schematic diagrames of the Fig. 1 based on bag of words and term vector.
Fig. 2 trains the CBOW models and Skip-gram models of term vector.
Fig. 3 uses the classifying quality comparison diagram of RandomForest graders.
Embodiment
Described specific embodiment is merely to illustrate the implementation of the present invention, and does not limit the scope of the invention.Below
Embodiments of the present invention are described in detail with reference to accompanying drawing, specifically include following steps:
1st, the formatting of data set.It is different for data set form, have using txt file data storage, have using pkl texts
Part data storage.The present invention implements to provide text processing system, and data set is uniformly converted into csv file, and CSV is a kind of general
, relatively simple file format plain text, it uses some character set, such as ASCII, Unicode, GB2312, UTF-8;
It is made up of (typically often one record of row) record;Every record is that (typical separation symbol has field by separators
Comma, branch, tab or space);Every record has same field sequence.
2nd, the pretreatment of data.It is generally necessary to text is segmented, removes stop words, low-frequency word.
(1) word segmentation processing, there is space as nature delimiter between English word, therefore no longer need to segment, it is only necessary to
Punctuate and numeral are removed can.But each word of Chinese is made up of the word of different numbers, in processes during text
Segmented firstly the need of to text.Automatic word segmentation is that to need computer according to the meaning of one's words be rational word by sentence cutting.
All it is the unit using word as minimum when handling natural language, the accuracy of participle directly affects the good of text classification
It is bad, therefore segmented firstly the need of to text, the present invention implements to have used stammerer participle bag to carry out Chinese word segmentation.
(2) go stop words, in text " ", " ", " ", the word such as " I " occurs in each text, these words for
Distinguishing the classification of document will not have an impact, thus remove them.For there is the stopwords storehouses of standard in English NLTK,
Stop words are easily removed, obtain good effect.But for Chinese, because the pause dictionary of no standard is, it is necessary to search down
Pause vocabulary is carried, removes stop words.
(3) influence of the low-frequency word for document is smaller, needs to remove low-frequency word in some cases;But it is exactly in some cases
These specific words are different from other documents.
(4) English reduces its word prototype due to tense, voice be present, it is necessary to stemmed in this case.
3rd, feature selecting.The dimension of feature space is usually more than 100,000, and the space of such higher-dimension will make computational efficiency very
It is low, or even calculate and can not carry out.And the contribution of some words is very weak in the text, can all occur in nearly all text, nothing
Feature of the method as particular text, therefore it is nonsensical to ensuing classification.Therefore selection being capable of generation from text for needs
The word of table text forms new feature space, so as to reach the purpose of dimensionality reduction.Conventional feature selection approach has text frequency
Method (Document frequency, DF), mutual information method (Mutual information, MI), information gain method
(Information gain, IG), X2Statistic law (CHI) etc., wherein the most widely used in text classification is that information increases
Beneficial method, present invention uses information gain method to carry out feature selecting.
4th, text representation.Text representation is exactly that formalization processing is carried out to text, represents to be enough in meter as computer capacity
The numeral of calculation, to reach computer it will be appreciated that the purpose of natural language text.The general text representation model used now for
Vector space model (VSM), maximally efficient in text classification is vector space model.The selection of text representation directly affects
To the effect of text classification.VSM basic thought is that substantial amounts of text representation is characterized into word matrix, so as to similar to text
The comparison of degree is converted into the similarity-rough set of characteristic vector spatially, than more visible and be readily appreciated that.In this feature word matrix
In, one-dimensional in each feature character pair space, the line number of matrix represents all textual datas to be sorted, by each text table
The a line being shown as in matrix, each row represent a Feature Words.In actual applications, vector space model is passed through frequently with TFIDF
For weighted value.TFIDF weight calculation formula are as follows:
5th, go training step 1 pretreated with neutral net language model (the Word2vec frameworks that Google increases income)
Data set, the data set for the use that the present invention is implemented is relatively small, using the n times of quantity come dilated data set of replicate data collection.
Training obtains a dictionary, and each word is a vector in dictionary, and these term vectors have captured the information of context.This hair
It is bright to use vector space model combination term vector, this document representation method, improve classifying quality.
6th, for obtaining the TFIDF weight matrix of vector space model in step 4, in this feature word matrix, Mei Gete
Levy one-dimensional in character pair space, the line number of matrix represents all textual datas to be sorted, by each text representation into matrix
In a line, each row represent a Feature Words.The TFIDF weighted values that many Feature Words are had in this matrix are zero, these
The feature weight for being zero influences the effect of classification.The present invention considers the term vector obtained using step 5, it is proposed that for TFIDF
The Feature Words that weight is zero, its similar word is searched with term vector, the weighted value for the similar word being not zero with these TFIDF values
Carry out the Feature Words that approximate representation this TFIDF value is zero.Specific implementation is as follows:For obtained vector space model, its is right
The TFIDF weight matrix answered, some Feature Words t in its certain a line, if its feature weight WtIt is zero, can uses:
(1) feature weight WtWith Feature Words t close word t1, t2, t3..., tnWeight Wt1, Wt2, Wt3..., WtnCome
Approximate representation Wt, can be controlled as similar word n quantity by the similarity threshold m of controlling feature word size, such as formula
(1) shown in.
(2) feature weight WtWith Feature Words t close word t1, t2, t3..., tnIn most close word weight WiCarry out approximate table
Show Wt, as shown in formula (2).
7th, the text model established for the present invention is classified using RandomForest graders, and random forest cares for name
Think justice, be to establish a forest with random manner, be made up of inside forest many decision trees, each of random forest is certainly
Be between plan tree do not have it is related.After forest is obtained, when thering is a new input sample to enter, just allow in forest
Each decision tree once judged respectively, look at which kind of (for sorting algorithm) this sample should belong to, then
It is most to look at which kind of is chosen, just predicts that this sample is a kind of for that.SST (Standford are used for categorized data set
Sentiment treebankdataset), the classification accuracy for the model that contrast bag of words and the present invention change, the present invention
The classification degree of accuracy of the processing method of the text representation based on bag of words proposed is higher.
Claims (3)
1. a kind of processing method of the text representation based on bag of words, it is characterised in that comprise the following steps:
The first step, pretreatment;
Text data set is segmented, goes stop words, removes low-frequency word, then carries out Feature Words selection;
Second step, text data set after pretreatment, is represented with bag of words;Described bag of words be using TFIDF as
The text representation model of weight;
3rd step, text data set after pretreatment, train to obtain term vector with neutral net Natural Language Processing Models;
4th step, the similitude of the term vector obtained according to the 3rd step change the power of the Feature Words for the bag of words that second step obtains
Weight, obtains new text representation model;Concrete modification mode is:The TFIDF obtained for second step is the text representation of weight
Model, some Feature Words t in certain a line of its corresponding feature weight matrix, if its feature weight WtBe zero, then feature
Weight WtWith Feature Words t close word t1, t2, t3..., tnWeight Wt1, Wt2, Wt3..., WtnCarry out approximate representation Wt, it is similar
The similarity threshold m that the quantity n of word passes through controlling feature word size controls.
2. the processing method of a kind of text representation based on bag of words according to claim 1, it is characterised in that second
In step, text data set after pretreatment is replicated n times, n is positive integer, for the size of dilated data set, then with god
Term vector is obtained through nature network Language Processing model training.
A kind of 3. processing method of text representation based on bag of words according to claim 1 or 2, it is characterised in that
4th step, the similitude of the term vector obtained according to the 3rd step change the weight of the Feature Words for the bag of words that second step obtains,
Obtain new text representation model;Concrete modification mode is:The TFIDF obtained for second step is the text representation mould of weight
Type, some Feature Words t in certain a line of its corresponding feature weight matrix, if its feature weight WtIt is zero, then feature is weighed
Weight WtWith Feature Words t close word t1, t2, t3..., tnIn most close word weight WiCarry out approximate representation Wt。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710005310 | 2017-01-05 | ||
CN2017100053106 | 2017-01-05 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107357895A true CN107357895A (en) | 2017-11-17 |
CN107357895B CN107357895B (en) | 2020-05-19 |
Family
ID=60292842
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710569638.0A Expired - Fee Related CN107357895B (en) | 2017-01-05 | 2017-07-14 | Text representation processing method based on bag-of-words model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107357895B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109284382A (en) * | 2018-09-30 | 2019-01-29 | 武汉斗鱼网络科技有限公司 | A kind of file classification method and computing device |
CN109543036A (en) * | 2018-11-20 | 2019-03-29 | 四川长虹电器股份有限公司 | Text Clustering Method based on semantic similarity |
CN110362815A (en) * | 2018-04-11 | 2019-10-22 | 北京京东尚科信息技术有限公司 | Text vector generation method and device |
WO2020199595A1 (en) * | 2019-04-04 | 2020-10-08 | 平安科技(深圳)有限公司 | Long text classification method and device employing bag-of-words model, computer apparatus, and storage medium |
CN111859901A (en) * | 2020-07-15 | 2020-10-30 | 大连理工大学 | English repeated text detection method, system, terminal and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
CN103927302A (en) * | 2013-01-10 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Text classification method and system |
US20150026104A1 (en) * | 2013-07-17 | 2015-01-22 | Christopher Tambos | System and method for email classification |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN104809131A (en) * | 2014-01-27 | 2015-07-29 | 董靖 | Automatic classification system and method of electronic documents |
CN104881400A (en) * | 2015-05-19 | 2015-09-02 | 上海交通大学 | Semantic dependency calculating method based on associative network |
-
2017
- 2017-07-14 CN CN201710569638.0A patent/CN107357895B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102930063A (en) * | 2012-12-05 | 2013-02-13 | 电子科技大学 | Feature item selection and weight calculation based text classification method |
CN103927302A (en) * | 2013-01-10 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Text classification method and system |
US20150026104A1 (en) * | 2013-07-17 | 2015-01-22 | Christopher Tambos | System and method for email classification |
CN104809131A (en) * | 2014-01-27 | 2015-07-29 | 董靖 | Automatic classification system and method of electronic documents |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN104881400A (en) * | 2015-05-19 | 2015-09-02 | 上海交通大学 | Semantic dependency calculating method based on associative network |
Non-Patent Citations (1)
Title |
---|
朱雪梅: "基于Word2Vec主题提取的微博推荐", 《中国优秀硕士学位论文全文数据库 信息科技辑 2016年第03期》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362815A (en) * | 2018-04-11 | 2019-10-22 | 北京京东尚科信息技术有限公司 | Text vector generation method and device |
CN109284382A (en) * | 2018-09-30 | 2019-01-29 | 武汉斗鱼网络科技有限公司 | A kind of file classification method and computing device |
CN109284382B (en) * | 2018-09-30 | 2021-05-28 | 武汉斗鱼网络科技有限公司 | Text classification method and computing device |
CN109543036A (en) * | 2018-11-20 | 2019-03-29 | 四川长虹电器股份有限公司 | Text Clustering Method based on semantic similarity |
WO2020199595A1 (en) * | 2019-04-04 | 2020-10-08 | 平安科技(深圳)有限公司 | Long text classification method and device employing bag-of-words model, computer apparatus, and storage medium |
CN111859901A (en) * | 2020-07-15 | 2020-10-30 | 大连理工大学 | English repeated text detection method, system, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107357895B (en) | 2020-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108197098B (en) | Method, device and equipment for generating keyword combination strategy and expanding keywords | |
Jain et al. | Application of machine learning techniques to sentiment analysis | |
CN108573047A (en) | A kind of training method and device of Module of Automatic Chinese Documents Classification | |
CN106776538A (en) | The information extracting method of enterprise's noncanonical format document | |
CN110580292A (en) | Text label generation method and device and computer readable storage medium | |
CN105335352A (en) | Entity identification method based on Weibo emotion | |
CN106844424A (en) | A kind of file classification method based on LDA | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
Gomes et al. | A comparative approach to email classification using Naive Bayes classifier and hidden Markov model | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN107944014A (en) | A kind of Chinese text sentiment analysis method based on deep learning | |
CN106095928A (en) | A kind of event type recognition methods and device | |
CN110069627A (en) | Classification method, device, electronic equipment and the storage medium of short text | |
CN113590764B (en) | Training sample construction method and device, electronic equipment and storage medium | |
CN109582794A (en) | Long article classification method based on deep learning | |
CN107180084A (en) | Word library updating method and device | |
CN109446423B (en) | System and method for judging sentiment of news and texts | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN112434164B (en) | Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration | |
CN113505200A (en) | Sentence-level Chinese event detection method combining document key information | |
CN114416979A (en) | Text query method, text query equipment and storage medium | |
CN111475607A (en) | Web data clustering method based on Mashup service function characteristic representation and density peak detection | |
CN115238040A (en) | Steel material science knowledge graph construction method and system | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200519 Termination date: 20210714 |