CN110162765A - A kind of machine aid reading auditing method and system based on abstract mode - Google Patents
A kind of machine aid reading auditing method and system based on abstract mode Download PDFInfo
- Publication number
- CN110162765A CN110162765A CN201810142416.5A CN201810142416A CN110162765A CN 110162765 A CN110162765 A CN 110162765A CN 201810142416 A CN201810142416 A CN 201810142416A CN 110162765 A CN110162765 A CN 110162765A
- Authority
- CN
- China
- Prior art keywords
- text
- abstract
- content
- module
- parsing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000000605 extraction Methods 0.000 claims abstract description 25
- 230000008569 process Effects 0.000 claims abstract description 21
- 239000000284 extract Substances 0.000 claims abstract description 13
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 48
- 238000012360 testing method Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 7
- 238000007477 logistic regression Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 abstract description 36
- 238000012550 audit Methods 0.000 abstract description 26
- 238000010586 diagram Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000012502 risk assessment Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 241000590419 Polygonia interrogationis Species 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000013100 final test Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of machine aid reading auditing methods and system based on abstract mode, realize process are as follows: typing text, and complete the parsing of data content and format;Classify to content of text after parsing, same category content is polymerize and marked class label, forms the mac function with class label;Extract corresponding clip Text in each mac function;Clip Text is exported, in conjunction with the opinion for the personnel of checking, result is checked in formation.By machine mould, original text abstract, and the source language message assisted to make a summary are extracted in advance, can effectively be helped user quickly to pass through abstract and be completed audit work;Even if it is not right that autoabstract describes unclear or extraction, it can also quickly be corrected by corresponding textual content, dramatically save manual audit's cost, promote audit efficiency.
Description
Technical field
The present invention relates to document processing fields, and in particular to a kind of machine aid reading auditing method based on abstract mode
And system.
Background technique
There are the demand that large volume document reads audit in various industries, it is taking human as master that traditional document, which reads auditing method,
Document read auditing method, main processes include: by the document wait audit from operating information system export after pass through industry
Business industry specialists carry out subjective examination with human brain.For the data of magnanimity, amount of reading is huge, needs to be managed according to document content
Solution, carries out decision at judgement.Due to being all largely Un-structured or partly-structured data in document, and the people for writing document is horizontal
Thinking is not quite similar again, and people's all the elements in review process is caused to require to carry out to understand and check, and emphasis pass is actually needed
The content of note is in fact and few, and time cost and human cost waste are serious, and inefficiency.
With information technology in recent years since greatly develop, the acquisition of various information datas and to provide frequency quicker,
This has aggravated the complexity and difficulties of professional audit again to a certain extent, only far from by traditional text auditing method
The development for adapting to society, is not able to satisfy the actual demand of enterprise itself.At present in audit industry, there are no mature to examine
Read solution.
Based on the above issues, need to develop a kind of machine aid reading auditing method or system, the accurate weight for understanding document
Content is wanted, brief, accurate, important document content is provided for auditor, improves auditor's working efficiency.
Summary of the invention
In order to overcome the above problem, present inventor has performed sharp studies, provide a kind of machine based on abstract mode
Aid reading auditing method and system, by carrying out piecemeal classification adjustment class label to the document of input, abstract is extracted and is obtained
Information and last edit-modify are paid close attention to, the effect data that user wants is obtained, realizes audit portfolio abstract output,
Thereby completing the present invention.
The purpose of the present invention is to provide following technical schemes:
(1) a kind of machine aid reading auditing method based on abstract mode, the described method comprises the following steps:
Step 100, typing text, and complete the parsing of data content and format;
Step 200, classify to content of text after parsing, same category content is polymerize and marks classification mark
Label form the mac function with class label;
Step 300, corresponding clip Text in each mac function is extracted;
Step 400, clip Text is exported, in conjunction with the opinion for the personnel of checking, result is checked in formation.
(2) a kind of system for realizing above-mentioned (1) the method, the system comprises:
Typing parsing module is used for typing text, and completes the parsing of data content and format;
Same category content is polymerize and is marked for classifying to content of text after parsing by piecemeal categorization module
Class label is infused, the mac function with class label is formed;
Abstract abstraction module, for extracting corresponding clip Text in each mac function;
Output edit module of making a summary, in conjunction with the opinion for the personnel of checking, forms for exporting clip Text and checks result.
A kind of machine aid reading auditing method and system based on abstract mode provided according to the present invention, has following
The utility model has the advantages that
(1) in the present invention, by machine mould, original text abstract, and the source language message assisted to make a summary are extracted in advance, is helped
User quickly passes through abstract and completes audit work;Even if it is not right that autoabstract describes unclear or extraction, correspondence can also be passed through
Textual content quickly corrected, dramatically save manual audit's cost, promote audit efficiency;
(2) in the present invention, by first converting XML format for Word document or PDF document format, it is then converted to plain text
Format, it is ensured that initial data is not lost, and guarantees analysis mass;
(3) in the present invention, by whole text by being divided into different mac functions, being not only conducive to subsequent operation can be with
It is clear, comprehensive, be quickly found and need the content that extracts, while the type for extracting data can be apparent from;
(4) in the present invention, according to block feature, to the corresponding machine mould of each mac function training, (sentence is chosen
Model), the corresponding clip Text directive property and accuracy extracted is stronger;And use the machine mould for being directed to each mac function
It is accurate that type and universality machine mould determine that the clip Text of each mac function can not only further increase extraction jointly
Property, it is often more important that, it is able to solve the problem of mac function is without corresponding machine model or the few caused machine of training sample amount
The problem of model accuracy deficiency.
Detailed description of the invention
Fig. 1 shows a kind of machine aid reading auditing method based on abstract mode of preferred embodiment according to the present invention
Flow chart;
Fig. 2 shows the Word document schematic diagrames inputted in illustration;
The Word document that Fig. 3 shows input resolves to XML data format schematic diagram;
Fig. 4 shows the software interface figure of output abstract;
Fig. 5 shows the Word document schematic diagram inputted during model training process or actual classification;
Fig. 6 shows the tree-like file structure schematic diagram formed after parsing file structure;
Fig. 7 is shown to schematic diagram after increase structural information before document text;
Fig. 8 show by increase segmented based on text after structural information after result;
Fig. 9 show use statistic algorithm assign participle after each word with the result schematic diagram of characteristic value;
The software interface that Figure 10 shows abstract starts schematic diagram after editing mode;
Figure 11 is shown in embodiment 2, classification accuracy result of the disaggregated model to audit portfolio before borrowing in credit audit;
Figure 12 is shown in embodiment 2, and NDCG@5 evaluates order models to the effect of each classification.
Specific embodiment
Below by drawings and examples to the exemplary detailed description of the present invention.Illustrated by these, the features of the present invention
It will be become more apparent from advantage clear.
Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary "
Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.
A kind of machine aid reading auditing method based on abstract mode provided according to the present invention, the auditing method are used for
Important information extraction is carried out to document in audit work, user is supplied in a manner of abstract, user is made quick and precisely to realize audit
Work.Text audit is Fast Reading and the audit work to specific industry same class Un-structured or semi-structured text, and
Form final audit conclusion and opinion.Wherein, Un-structured text refers to the two-dimentional logical table (structure being not easy to database
Change) come the data text that shows;Semi-structured text is the data of structuring, but structure change is very big;Because it is to be understood that number
According to details handled so data cannot be simply organized into a text according to unstructured data, very due to structure change
A two-dimentional logical table can not be simply established greatly to be corresponding to it.
In the present invention, Un-structured to be processed or semi-structured text are batch, structural similarity in same industry
Text that is high, thering is certain specification to guide, i.e. " same class " text.It is exemplified below, " certain project financing is awarded in audit of loan industry
Believe survey report " or " examination report of certain company application loan ", this class text generally has fixed, clear in relevant departments
Clear structure, and the main contents paid close attention in industry are close, are conducive to carry out batch processing.
As shown in Figure 1, a kind of machine aid reading auditing method based on abstract mode provided by the invention, including it is following
Step:
Step 100, typing text, and complete the parsing of data content and format;
Step 200, classify to content of text after parsing, same category content is polymerize and marks classification mark
Label form the mac function with class label;
Step 300, corresponding clip Text in each mac function is extracted;
Step 400, clip Text is exported, in conjunction with the opinion for the personnel of checking, result is checked in formation.
Step 100, Characters parse: typing text, and complete the parsing of data content and format.
In the present invention, text formatting allows for existing any document format, preferably with Word document or the lattice of PDF document
Formula typing urtext, two kinds of format files are also the main ways of presentation of audit portfolio.
Since Word document or PDF document are to provide the visual information of people, but machine can not identify the letter being wherein loaded with
Content is ceased, needs to convert machine-processable format, i.e. plain text format for above two text formatting, such as txt format.
In a preferred embodiment, if typing text be Word document or PDF document format, the document is direct
It is converted into plain text format.
In further preferred embodiment, if typing text be Word document or PDF document format, will be in document
Context resolution is XML (extensible markup language) data, then obtains plain text format text by parsing XML data.Citing is such as
Under, input Word document (see Fig. 2), call LibreOffice program, by the Context resolution in Word document at XML data (see
Fig. 3), then according to " Open Document Format for Office Applications (OpenDocument) Version
1.2 " OASIS standards obtain final text document result by parsing XML data.It is preferred that the reason of using XML method is,
Other extract the tool of text data, meeting lost part initial data or data format using Word extraction tool, and first convert
For the mode of XML data, it is ensured that then the integrality of initial data parses XML content, since XML parsing will not be limited
In the reason of third party's analysis mass, XML data is converted into plain text format data and can define by project demands, to be not easy to lose
Text and format information are lost, guarantees analysis mass.
In the present invention, the text obtained after final parsing remains with the titles at different levels i.e. structure of an article information of original text shelves,
And paragraph structure is identical as the paragraph structure of original text shelves, i.e., the text of paragraph is formed in the content of text and original text shelves of composition paragraph
Content is identical.Meanwhile the text obtained after final parsing carries out tissue by basic unit of clause, i.e., with comma, fullstop, ask
Number, exclamation mark, branch text is divided into clause.
Further, the parsing further include after sequence gives parsing text neutron sentence number accordingly, and with number shape
Form a complete sentence subindex.In this way, the abstract of subsequent extraction and clause original in text can be linked by sentence index, clearly
Chu efficiently obtains abstract source.
As Fig. 4 shows the software interface of output abstract.During abstract is shown, left side is summary region, and right side is to parse hereinafter
One's respective area, the abstract result of each classification in left side are from extracting in the content of same label in the text of right side, thus
It is selected the sentence to make an abstract in text all to be covered by colored shading, is one-to-one with left side clip Text;Simultaneously
In order to be more convenient the source of confirmation abstract, auditor can click left side clip Text, and corresponding right side can jump directly to phase
The abstract answered extracts position, and is highlighted with band color shading.The above corresponding relationship is to index to realize by sentence, right side
Document is to carry out tissue by basic unit of clause, and sequence gives corresponding number, as long as then left side record right side text
In clause's number obtain clause's content and corresponding corresponding relationship.
Step 200, piecemeal is classified: classifying to content of text after parsing, same category content is polymerize and marked
Class label is infused, the mac function with class label is formed.
To distinguish text data type, object content (paying close attention to content) is clearly found, is needed to text after parsing
Middle content is classified, and by it is content-aggregated after classification, stamp class label;It is different to be presented with the mode of mac function
Content of text.
By taking credit audit portfolio as an example, the document can generally be divided into following ten classifications: summary information, enterprise's back
Scape, business circumstance, credit position, account analysis, guarantee analysis, mortgage analysis, project situation, risk analysis, branch's opinion, shape
At corresponding ten mac functions.Text data type affiliation is clear after piecemeal classification, convenient for auditor to audit portfolio
Processing.
In the present invention, the polymerization refers to neighbouring similar content set, in this way can be on the basis for keeping original text sequence
On presented;Alternatively, neighbouring and not neighbouring similar content is gathered, such original text sequence may be changed,
But it is easy for the similar content of integrated treatment.
In a kind of preferred embodiment of the present invention, using paragraph as basic unit, classify to content of text.
In a kind of preferred embodiment of the present invention, content of text is carried out using logistic regression method building disaggregated model
Classification.Disaggregated model building includes training process and test process:
Training process: it is affiliated classification by corpus labeling, forms training sample;The feature of training sample is extracted to train
Model;The corresponding interface of corresponding model is called when training pattern, such as the adoptable third party of the present invention, which increases income, wraps sklearn
Linear_model disaggregated model be trained;
Test process: using mark or un-annotated data as test sample;It is loaded after extracting the feature of test sample
Model obtains classification results;Model is adjusted according to classification results, until obtaining the high model of classification accuracy.
In a kind of preferred embodiment of the present invention, for referring to during model training or during actual classification
Show the feature extraction of classification ownership from structure of an article information and text information.Wherein, structure of an article information refers to each of document
Grade title;Text information refer to do not include titles at different levels document body matter.
It is important classification information, by it since structure of an article information has guide or summary to act on its ensuing disclosure
It is included in feature extraction, improves the accuracy of classification.
The feature carried out during model training process or actual classification to structure of an article information and text information mentions
It takes including following procedure:
I) it is expressed intact document (training sample or test sample) structural information, parses file structure, and by chapter knot
Structure information forms tree-like file structure;As shown in figure 5, training sample or test sample are Word document structure, it, will after parsing
Structure of an article information forms tree-like file structure as shown in FIG. 6;
The resolving, which refers to, converts XML data for document (training sample or test sample) information, then by XML number
According to middle extraction text information and structure of an article information.After being parsed, tree-like file structure is converted by structure of an article information.
II before document titles at different levels) are placed in respective document text by the tree-like file structure, title+text is formed
Content-form, as shown in fig. 7, to increase structural information;
III) text is segmented, as shown in Figure 8;Each word forms spy with characteristic value after using statistic algorithm to assign participle
Sign;As shown in figure 9, calculating its characteristic value according to TF-IDF forms feature, wherein having carried out extensive processing to the word of setting class, such as
Numeric type is generalized for<num>, name is generalized for<person>, punctuate etc. is removed as stop words;
IV) in model training, training in feature input logic regression model (such as LR model) is obtained into disaggregated model;?
In test process, in disaggregated model that feature is inputted, classify.
Step 300, abstract extracts: extracting corresponding clip Text in each mac function.
In step 300, the corresponding machine mould of each mac function training is extracted corresponding according to block feature
Clip Text.
Abstract is extracted using sentence Selection Model (Rank), and it is that clause (with comma, fullstop, asks that minimum, which chooses unit,
Number, the short sentence that is separated to form of exclamation mark, branch punctuate).After the processing of sentence Selection Model, preceding n contents for taking sequence high (can
Situation adjusts n value according to demand, and sentence length is adjustable) corresponding clause is as abstract result.
Sentence Selection Model preferably passes through the Boosting homing method (Gradient of such as sklearn in the present invention
Boosting Regressor method) for lexical content do regression training, or other order models are used, training obtains.
The building of sentence Selection Model includes training process and test process:
Training process: sentence in corpus is labeled as affiliated classification (such as " being abstract ", " non-abstract ", " important abstract "
Etc. classifications), formed training sample;Extract the feature of training sample;Training pattern;
Test process: using mark or un-annotated data as test sample;It is loaded after extracting the feature of test sample
Model is obtained test result and is ranked up with test result;Model is adjusted according to ranking results accuracy, is obtained most
Whole sentence Selection Model.
In a preferred embodiment, sentence Selection Model building process or in actual use, feature extraction
Unit is clause, and is no longer paragraph;After participle, feature extraction is carried out, that is, each word is formed after assigning participle with characteristic value
Feature.It is found in practice, feature extraction can use word frequency statistics method or TF-IDF algorithm, preferably word frequency statistics side
Method.Feature input sequencing model carry out in the sentence Selection Model after the building of sentence Selection Model, or input building
Abstract extracts.
In further preferred embodiment, to sentence mark starting character and termination in training sample and test sample
Symbol, is included in feature extraction range for the starting character of sentence and full stop.The starting character refers to the special symbol for indicating that sentence starts
Number;Full stop refers to the additional character for indicating that sentence terminates;Sentence is the punctuation mark work that terminated with sentences such as fullstop or question marks
It may include multiple clauses for the character string of ending.
Meanwhile the mark of starting character and full stop help to obtain structure of an article information, the reason is that, structure of an article information
It is not sentence or clause, starting character can not be marked (such as before and after structure of an article information<s>) and full stop is (such as</s>).In this way,
By judging before and after character string whether without starting character and full stop structure of an article information can be quickly obtained, then the structure of an article is believed
Breath is used for feature extraction, obtains the clip Text for having structure of an article information.
Characteristic extraction step is exemplified below:
Input clause " 5,000,000 yuan of the said firm's registered capital, " marks starting character: "<s>the said firm registered capital 5,000,000
Member, ";
Participle, as a result are as follows: "<s>/should/company/registration/capital/50,0/0,000 yuan, ";
Feature extraction, as a result are as follows: " "<s>": 1, " being somebody's turn to do ": 1, " Wan Yuan ": 1, "<num>": 1, " registration ": 1, " capital ":
1, " "tibco software, inc." "TIBCO Software: 1 } ".
Since the document that the present invention is handled is same class document, thus the mac function that piecemeal classification obtains is limited and solid
It is fixed, be conducive to training and obtain the machine mould of high precision.
In a kind of preferred embodiment of the present invention, in addition to the corresponding machine mould of each mac function training,
Also directed to entire chapter text training universality machine mould.Feature extraction is from entire chapter document in universality machine mould training process,
Suitable for carrying out clip Text extraction to each mac function.
Preferably, it is determined jointly using the sentence Selection Model and universality machine mould for being directed to each mac function each
The clip Text of mac function.For example, assigning sentence Selection Model and universality machine mould with corresponding weight, sentence is selected
The result that modulus type measures obtains result A multiplied by its weight, and the result that universality machine mould measures is obtained multiplied by its weight
As a result B, then by being calculated result A and result B, being converted, the final testing result to certain clause is obtained, with the final survey
Test result is as sort by.
It is here, training is directed to the reason of universality machine mould of entire chapter text: although the knot of " same class " document
Structure similarity is very high, however, file structure is inevitably changed with the subjective initiative of people.In this case,
There may be do not train corresponding sentence for the corresponding sentence Selection Model of certain mac function training, or for certain mac function
The problem of sample size of Selection Model is few, and the model stability and accuracy that training obtains cannot meet the needs.And universality
Machine mould can test certain clause to the significance level of entire chapter document, under normal circumstances, if compared to the important journey of entire chapter document
Degree is high, then the higher possibility of the significance level in corresponding function block is very big, can from there through universality machine mould
Solve the problems, such as without corresponding sentence Selection Model or training sample amount it is few caused by sentence Selection Model accuracy is insufficient asks
Topic;Even if training obtains mature sentence Selection Model, universality machine mould can also cooperate with the adjustment of sentence Selection Model to pluck
Content is extracted, abstract is improved and extracts accuracy.
Step 400, form conclusion: output clip Text, in conjunction with the opinion for the personnel of checking, result is checked in formation.
On the clip Text that step 300 is formed, the personnel of checking can also modify manually, increase corresponding conclusion letter
Breath, to reach the effect data that the personnel of checking want.The modification refers to increase or deletes clause.
If Figure 10 shows the software interface of output abstract, left side is summary region, and right side is text filed after parsing.Pass through
" editor's abstract " label is clicked, editing mode is started, summary page can be edited in left side;" X " label is clicked, deletion is received and refers to
Show, deletes abstract;Choosing is clicked or drawn in the original text of right side and needs increased content, and instruction is elected in reception additional member, and the abstract of selection is increased
It is added to left side summary region.
It is another aspect of the invention to provide a kind of machine aid reading auditing systems based on abstract mode, for real
The above method is applied, which includes:
Typing parsing module is used for typing text, and completes the parsing of data content and format;
Same category content is polymerize and is marked for classifying to content of text after parsing by piecemeal categorization module
Class label is infused, the mac function with class label is formed;
Abstract abstraction module, for extracting corresponding clip Text in each mac function;
Output edit module of making a summary, in conjunction with the opinion for the personnel of checking, forms for exporting clip Text and checks result.
In the present invention, the text of typing is converted plain text format document, such as text document by typing parsing module.Record
Entering parsing module allows to input existing any document format, preferably input Word document or PDF document.
If typing text is Word document or PDF document, the Context resolution in document is first XML number by typing parsing module
According to, then pass through parsing XML data acquisition plain text format document.
Further, typing parsing module gives the clause after conversion in text sequentially also to number accordingly, and to compile
Number formed sentence index.
In the present invention, piecemeal categorization module minimum classifies to content of text using paragraph as basic unit.It is preferred that
Ground, piecemeal categorization module carry out content of text classification using logistic regression method building disaggregated model.
In the present invention, abstract abstraction module minimum is to choose unit with clause, make a summary in each mac function and extracts.
In a preferred embodiment, abstract abstraction module is according to block feature, to the training of each mac function
Corresponding machine mould extracts corresponding clip Text.
In further preferred embodiment, in addition to the corresponding machine mould of each mac function training, needle is gone back
To entire chapter text training universality machine mould;Using the sentence Selection Model and universality machine for being directed to each mac function
Model determines the clip Text of each mac function jointly.
In the present invention, abstract output edit module includes that abstract output sub-module, abstract display sub-module and abstract are compiled
Collect submodule:
Make a summary output sub-module, for receive abstract abstraction module instruction, according to abstract abstraction module determine extraction in
Hold, corresponding clause number is sent to abstract display sub-module;
Abstract display sub-module, it is aobvious for receiving clause's number information progress clause's content that abstract output sub-module is sent
Show;The edit instruction that abstract editor's submodule is sent is received, corresponding clause is deleted or shows the opinion that the personnel of checking edit;
Abstract editor's submodule receives starting editing mode and indicates and start editing mode, receives edit instruction and transmit
To abstract display sub-module, implement display Edition Contains (clause deletes or increase the opinion for the personnel that check).
Embodiment
Embodiment 1
By taking the Word document " examination report of first company application loan " of input as an example, by carrying out machine auxiliary to text
It reads, obtains the clip Text of user's concern, " examination report of first company application loan " content is as shown in Figure 2:
The first step calls LibreOffice program, by the Context resolution in Word document at XML data (see Fig. 3), then
Final plain text document result is obtained by parsing XML;
Second step, the LR model obtained by training carry out classification piecemeal to the plain text document after parsing, obtain " industry
Business background " mac function;
Third step, the sentence Selection Model obtained using the training of Boosting homing method is to " business background " mac function
Carry out abstract extraction;
It makes a summary during model training, it is abstract (1) and (0) two class of non-abstract that the sample set manually marked, which is only marked, is used
Real number value of the Gradient Boosting Regressor method prediction result between 0-1, is ranked up with this result, and
Take top-n result as abstract as a result, as shown in table 1.If n is 1, best abstract " the said firm's registered capital 5,000,000 is obtained
Member, ";
Table 1
Annotation results | Prediction result | Sentence | Explanation |
0 | 0.245879352093 | By the report period, | Non- abstract |
1 | 0.886647164822 | 5,000,000 yuan of the said firm's registered capital, | Abstract |
0 | 0.677558422089 | Wherein Li Si provides funds 4,500,000 yuan, | Non- abstract |
0 | 0.0709818303585 | Accounting 90%, | Non- abstract |
0 | 0.538590252399 | Zhang San provides funds 500,000 yuan, | Non- abstract |
0 | 0.0706759169698 | Accounting 10%. | Non- abstract |
4th step exports clip Text " 5,000,000 yuan of the said firm's registered capital, ", in conjunction with the opinion for the personnel of checking, is formed and is examined
Read result.
Embodiment 2
Audit portfolio before borrowing in credit audit, can be divided into following classification: summary information, business background, business circumstance,
Project situation, account analysis, credit position, borrowing arrangements, repayment schedule, guarantee analysis, mortgage analysis, risk analysis, risk
Prevention, overall assessment and branch's opinion.
Using feature extracting method in the present invention, the disaggregated model (LR model) that training obtains carries out document mac function
It divides, as shown in figure 11, the classification accuracy that is generally averaged on test set reaches 93.1%.
It extracts to obtain summary info using method in the present invention.Abstract is also used due to using order models
The NDCG of order standard as evaluation criterion, it is whole borrow before audit documentation summary extract NDCG result such as the following table 2 (Top5):
Table 2
Type | NDCG@1 | NDCG@2 | NDCG@3 | NDCG@4 | NDCG@5 |
As a result | 0.816782 | 0.814651 | 0.821526 | 0.821179 | 0.826785 |
By taking NDCG@5 as an example, the effect in each classification is shown, as shown in figure 12.By table 2 and Figure 12 it is found that the present invention plucks
It is higher to extract accuracy, meets credit audit industry fifes processing requirement.
Combining preferred embodiment above, the present invention is described, but these embodiments are only exemplary
, only play the role of illustrative.On this basis, a variety of replacements and improvement can be carried out to the present invention, these each fall within this
In the protection scope of invention.
Claims (10)
1. a kind of machine aid reading auditing method based on abstract mode, which is characterized in that method includes the following steps:
Step 100, typing text, and complete the parsing of data content and format;
Step 200, classify to content of text after parsing, same category content is polymerize and marked class label, shape
At the mac function with class label;
Step 300, corresponding clip Text in each mac function is extracted;
Step 400, clip Text is exported, in conjunction with the opinion for the personnel of checking, result is checked in formation.
2. the method according to claim 1, wherein in step 100, the parsing includes by typing text lattice
Formula is converted into plain text format;
Preferably, typing text is Word document or PDF document format, is XML data by the Context resolution in document, then pass through
It parses XML data and obtains plain text format text.
3. the method according to claim 1, wherein in step 100, the parsing further includes sequentially giving to solve
Text neutron sentence is numbered accordingly after analysis, and forms sentence index with number.
4. the method according to claim 1, wherein in step 200, being constructed and being classified using logistic regression method
Model carries out content of text classification;Preferably, content of text classification is carried out by basic unit of paragraph.
5. the method according to claim 1, wherein in step 200, disaggregated model building includes training process
And test process:
Training process: it is affiliated classification by corpus labeling, forms training sample;The feature of training sample is extracted to train mould
Type;
Test process: using mark or un-annotated data as test sample;Stress model after the feature of extraction test sample,
Obtain classification results;Model is adjusted according to classification results, until obtaining the high model of classification accuracy;
Wherein, the characteristic extraction procedure during model training process or actual classification includes: parsing file structure, and by a piece
Chapter structural information forms tree-like file structure;Document titles at different levels are placed in respective document text by the tree-like file structure
Before, title+text content-form is formed, feature extraction is carried out based on text in this content-form.
6. the method according to claim 1, wherein in step 300, according to block feature, to each function
The corresponding machine mould of block training, extracts corresponding clip Text.
7. according to the method described in claim 6, it is characterized in that, being removed corresponding to the training of each mac function in step 300
Machine mould outside, also directed to entire chapter text training universality machine mould;
Preferably, each mac function is determined using the machine mould and universality machine mould that are directed to each mac function jointly
Clip Text.
8. the method according to claim 1, wherein the minimum unit of choosing for extraction of making a summary is son in step 300
Sentence, clause are the short sentence formed with comma, fullstop, question mark, exclamation mark, semicolon separated;After the processing of sentence Selection Model, the row of taking
The high corresponding clause of preceding n contents of sequence is as abstract as a result, wherein n value can adjust according to demand.
9. a kind of for implementing the system of one of the claims 1 to 8 the method, which includes:
Typing parsing module is used for typing text, and completes the parsing of data content and format;
Same category content is polymerize for classifying to content of text after parsing and marks class by piecemeal categorization module
Distinguishing label forms the mac function with class label;
Abstract abstraction module, for extracting corresponding clip Text in each mac function;
Output edit module of making a summary, in conjunction with the opinion for the personnel of checking, forms for exporting clip Text and checks result.
10. system according to claim 9, which is characterized in that abstract output edit module include abstract output sub-module,
Abstract display sub-module and abstract editor's submodule:
Making a summary output sub-module, will according to the extraction content that abstract abstraction module determines for receiving abstract abstraction module instruction
Corresponding clause's number is sent to abstract display sub-module;
Abstract display sub-module is shown for receiving clause's number information progress clause's content that abstract output sub-module is sent;
The edit instruction that abstract editor's submodule is sent is received, corresponding clause is deleted or shows the opinion that the personnel of checking edit;
Abstract editor's submodule receives starting editing mode and indicates and start editing mode, receives edit instruction and is transferred to and plucks
Display sub-module is wanted, display Edition Contains are implemented.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810142416.5A CN110162765A (en) | 2018-02-11 | 2018-02-11 | A kind of machine aid reading auditing method and system based on abstract mode |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810142416.5A CN110162765A (en) | 2018-02-11 | 2018-02-11 | A kind of machine aid reading auditing method and system based on abstract mode |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110162765A true CN110162765A (en) | 2019-08-23 |
Family
ID=67635126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810142416.5A Pending CN110162765A (en) | 2018-02-11 | 2018-02-11 | A kind of machine aid reading auditing method and system based on abstract mode |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110162765A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100312725A1 (en) * | 2009-06-08 | 2010-12-09 | Xerox Corporation | System and method for assisted document review |
CN104657347A (en) * | 2015-02-06 | 2015-05-27 | 北京中搜网络技术股份有限公司 | News optimized reading mobile application-oriented automatic summarization method |
CN107392143A (en) * | 2017-07-20 | 2017-11-24 | 中国科学院软件研究所 | A kind of resume accurate Analysis method based on SVM text classifications |
CN107403375A (en) * | 2017-04-19 | 2017-11-28 | 北京文因互联科技有限公司 | A kind of listed company's bulletin classification and abstraction generating method based on deep learning |
-
2018
- 2018-02-11 CN CN201810142416.5A patent/CN110162765A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100312725A1 (en) * | 2009-06-08 | 2010-12-09 | Xerox Corporation | System and method for assisted document review |
CN104657347A (en) * | 2015-02-06 | 2015-05-27 | 北京中搜网络技术股份有限公司 | News optimized reading mobile application-oriented automatic summarization method |
CN107403375A (en) * | 2017-04-19 | 2017-11-28 | 北京文因互联科技有限公司 | A kind of listed company's bulletin classification and abstraction generating method based on deep learning |
CN107392143A (en) * | 2017-07-20 | 2017-11-24 | 中国科学院软件研究所 | A kind of resume accurate Analysis method based on SVM text classifications |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alexa et al. | Text analysis software: Commonalities, differences and limitations: The results of a review | |
CN111930966B (en) | Intelligent policy matching method and system for digital government affairs | |
US8005815B2 (en) | Search engine | |
CN108038091A (en) | A kind of similar calculating of judgement document's case based on figure and search method and system | |
CN109933796B (en) | Method and device for extracting key information of bulletin text | |
US20050182736A1 (en) | Method and apparatus for determining contract attributes based on language patterns | |
CN110175585B (en) | Automatic correcting system and method for simple answer questions | |
CN105824791B (en) | A kind of bibliography format checking method | |
US7853595B2 (en) | Method and apparatus for creating a tool for generating an index for a document | |
CN112182148A (en) | Standard auxiliary compiling method based on full-text retrieval | |
CN112380848B (en) | Text generation method, device, equipment and storage medium | |
CN111144116B (en) | Document knowledge structured extraction method and device | |
CN117332761B (en) | PDF document intelligent identification marking system | |
Chieze et al. | An automatic system for summarization and information extraction of legal information | |
CN112347121B (en) | Configurable natural language sql conversion method and system | |
CN110162684B (en) | Machine reading understanding data set construction and evaluation method based on deep learning | |
CN112686013A (en) | Cable number head compiling system and method | |
Alexa et al. | Commonalities, differences and limitations of text analysis software: the results of a review | |
CN109325098B (en) | Reference resolution method for semantic analysis of mathematical questions | |
CN116611447A (en) | Information extraction and semantic matching system and method based on deep learning method | |
CN110765107A (en) | Question type identification method and system based on digital coding | |
CN110162765A (en) | A kind of machine aid reading auditing method and system based on abstract mode | |
CN114118098A (en) | Contract review method, equipment and storage medium based on element extraction | |
CN112488593B (en) | Auxiliary bid evaluation system and method for bidding | |
CN113722421A (en) | Contract auditing method and system and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190823 |