CN112732897A - Document processing method and device, electronic equipment and storage medium - Google Patents
Document processing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112732897A CN112732897A CN202011583169.6A CN202011583169A CN112732897A CN 112732897 A CN112732897 A CN 112732897A CN 202011583169 A CN202011583169 A CN 202011583169A CN 112732897 A CN112732897 A CN 112732897A
- Authority
- CN
- China
- Prior art keywords
- document
- target
- target document
- preset
- investment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000002372 labelling Methods 0.000 claims abstract description 24
- 238000012545 processing Methods 0.000 claims abstract description 24
- 238000012549 training Methods 0.000 claims description 35
- 238000012550 audit Methods 0.000 claims description 23
- 238000004458 analytical method Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 7
- 230000008901 benefit Effects 0.000 abstract description 5
- 230000002829 reductive effect Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000012797 qualification Methods 0.000 description 3
- 239000013589 supplement Substances 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application provides a document processing method, a document processing device, an electronic device and a computer readable storage medium. The method comprises the following steps: receiving a target document uploaded by a user, generating a document abstract of the target document through a preset automatic abstract tool, and extracting target keywords from the document abstract; judging whether the target document meets a preset requirement according to the number of the extracted target keywords, wherein the preset requirement is that the number of the target keywords contained in the document is larger than a preset threshold value; and if the target document meets the preset requirement, classifying and labeling the target document based on a pre-trained BERT model, and outputting the category and the label corresponding to the target document. Compared with the prior art, the method and the system have the advantages that the relations between the claimants and the issuing enterprises can be automatically classified and identified according to the investment relation certification documents uploaded by the claimants, manual identification of users is not needed, and consumption of manpower and time is greatly reduced.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a document processing method and apparatus, an electronic device, and a computer-readable storage medium.
Background
The marketing of a narrow company, namely the first open stock offering (IPO), refers to the process by which an enterprise first opens stock to investors through a stock exchange in order to collect funds for the enterprise to develop.
After many new stocks come into the market, relevant organizations need to audit the qualification of the buyers, and only the buyers passing the audit have the qualification of the buying. The correlation organization needs to examine the relationship between the claimant and the issuing enterprise, the relationship between the claimant and the issuing agency, and the like, so as to determine the subscription qualification of the claimant.
For the investment relation certification documents provided by the buyers, the conventional method is to upload the auditing data manually, then the related staff manually marks the relation between the buyers and the issuing enterprises (or issuing agencies), and after marking, the relation is checked by the auditors, so that the efficiency is low, errors are easy to occur, and the labor cost and the time cost are high.
Disclosure of Invention
The application aims to provide a document processing method and device, an electronic device and a computer readable storage medium.
A first aspect of the present application provides a document processing method, including:
receiving a target document uploaded by a user, generating a document abstract of the target document through a preset automatic abstract tool, and extracting target keywords from the document abstract;
judging whether the target document meets a preset requirement according to the number of the extracted target keywords, wherein the preset requirement is that the number of the target keywords contained in the document is larger than a preset threshold value;
and if the target document meets the preset requirement, classifying and labeling the target document based on a pre-trained BERT model, and outputting the category and the label corresponding to the target document.
According to some embodiments of the application, if the target document does not meet the preset requirement, a prompt message is sent to prompt a user that the target document does not meet the requirement and needs to be uploaded again.
According to some embodiments of the present application, the classifying and labeling the target document based on the pre-trained BERT model includes:
inputting the target document into a pre-trained BERT model to obtain an issuing enterprise entity, an agent entity and a buyer entity in the target document;
finding associated paragraphs of the publishing business entity and associated paragraphs of the agent entity in the target document according to the publishing business entity and the agent entity;
extracting the investment relationship between the issuing enterprise entity and the buyer entity from the association paragraph of the issuing enterprise entity and extracting the investment relationship between the agent entity and the buyer entity from the association paragraph of the agent entity by utilizing a pre-trained semantic analysis model;
and classifying the investment relations and marking corresponding labels according to the corresponding relation between the preset investment relations and the preset categories.
According to some embodiments of the application, the training process of the BERT model comprises:
determining a training document and an initial BERT model;
inputting the training document into the initial BERT model;
acquiring the output of the initial BERT model to obtain training feature representation information corresponding to the training document;
determining the prediction type of the training document according to the training feature representation information;
determining the actual category of the training document, and obtaining feedback information according to the actual category and the prediction category;
and adjusting the model parameters of the initial BERT model according to the feedback information to obtain the BERT model.
According to some embodiments of the present application, after classifying the investment relations and labeling the investment relations, the method further comprises:
performing pedigree analysis on pre-acquired multi-party data including issuing enterprises, agents and buyers to perform authenticity check on the investment relation;
and updating the investment relation according to the authenticity checking result, classifying the updated investment relation and marking a corresponding label.
In some embodiments according to the application, the method further comprises:
and generating an audit report according to a preset report template based on the category and the label corresponding to the target document.
In some embodiments according to the application, the method further comprises:
and correspondingly storing the audit report and the target document in a database.
A second aspect of the present application provides a document processing apparatus comprising:
the receiving module is used for receiving a target document uploaded by a user, generating a document abstract of the target document through a preset automatic abstract tool, and extracting a target keyword from the document abstract;
the judging module is used for judging whether the target document meets a preset requirement according to the number of the extracted target keywords, wherein the preset requirement is that the number of the target keywords contained in the document is larger than a preset threshold value;
and the labeling module is used for classifying and labeling the target document based on a pre-trained BERT model and outputting the category and the label corresponding to the target document if the target document meets the preset requirement.
According to some embodiments of the application, the document processing apparatus further comprises:
and the prompting module is used for sending out prompting information to prompt a user that the target document does not meet the preset requirement and needs to be uploaded again if the target document does not meet the preset requirement.
In some implementations of embodiments of the present application, the labeling module is specifically configured to:
inputting the target document into a pre-trained BERT model to obtain an issuing enterprise entity, an agent entity and a buyer entity in the target document;
finding associated paragraphs of the publishing business entity and associated paragraphs of the agent entity in the target document according to the publishing business entity and the agent entity;
extracting the investment relationship between the issuing enterprise entity and the buyer entity from the association paragraph of the issuing enterprise entity and extracting the investment relationship between the agent entity and the buyer entity from the association paragraph of the agent entity by utilizing a pre-trained semantic analysis model;
and classifying the investment relations and marking corresponding labels according to the corresponding relation between the preset investment relations and the preset categories.
According to some embodiments of the application, the training process of the BERT model comprises:
determining a training document and an initial BERT model;
inputting the training document into the initial BERT model;
acquiring the output of the initial BERT model to obtain training feature representation information corresponding to the training document;
determining the prediction type of the training document according to the training feature representation information;
determining the actual category of the training document, and obtaining feedback information according to the actual category and the prediction category;
and adjusting the model parameters of the initial BERT model according to the feedback information to obtain the BERT model.
According to some embodiments of the application, the document processing apparatus further comprises:
the checking module is used for classifying the investment relations and performing pedigree analysis on the pre-acquired multi-party data including the issuing enterprises, the agents and the claimants after the labeling module performs corresponding labeling on the investment relations so as to perform authenticity checking on the investment relations; and updating the investment relation according to the authenticity checking result, classifying the updated investment relation and marking a corresponding label.
According to some embodiments of the application, the document processing apparatus further comprises:
and the report generation module is used for generating an audit report according to a preset report template based on the category and the label corresponding to the target document.
According to some embodiments of the application, the document processing apparatus further comprises:
and the storage module is used for correspondingly storing the audit report and the target document in a database.
A third aspect of the present application provides an electronic device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program when executing the computer program to perform the method of the first aspect of the application.
A fourth aspect of the present application provides a computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of the first aspect of the present application.
Compared with the prior art, the document processing method, the document processing device, the electronic equipment and the storage medium provided by the application receive the target document uploaded by the user, generate the document abstract of the target document through a preset automatic abstract tool, and extract the target keywords from the document abstract; judging whether the target document meets a preset requirement according to the number of the extracted target keywords, wherein the preset requirement is that the number of the target keywords contained in the document is larger than a preset threshold value; and if the target document meets the preset requirement, classifying and labeling the target document based on a pre-trained BERT model, and outputting the category and the label corresponding to the target document. Compared with the prior art, the method and the system have the advantages that the relations between the claimants and the issuing enterprises can be automatically classified and identified according to the investment relation certification documents uploaded by the claimants, manual identification of users is not needed, and consumption of manpower and time is greatly reduced.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 illustrates one of the flow diagrams of a document processing method provided herein;
FIG. 2 illustrates a second flowchart of a document processing method provided by the present application;
FIG. 3 is a third flowchart of a document processing method provided by the present application;
FIG. 4 illustrates a schematic diagram of a document processing device provided by some embodiments of the present application;
FIG. 5 illustrates a schematic diagram of an electronic device provided by some embodiments of the present application;
FIG. 6 illustrates a schematic diagram of a computer-readable storage medium provided by some embodiments of the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.
In addition, the terms "first" and "second", etc. are used to distinguish different objects, rather than to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
The embodiment of the application provides a document processing method and device, an electronic device and a computer readable storage medium, which are described below with reference to the accompanying drawings.
Referring to fig. 1, which shows a flowchart of a document processing method provided in some embodiments of the present application, as shown in fig. 1, the document processing method may include the following steps:
step S101: receiving a target document uploaded by a user, generating a document abstract of the target document through a preset automatic abstract tool, and extracting target keywords from the document abstract;
step S102: judging whether the target document meets a preset requirement according to the number of the extracted target keywords, wherein the preset requirement is that the number of the target keywords contained in the document is larger than a preset threshold value;
step S103: and if the target document meets the preset requirement, classifying and labeling the target document based on a pre-trained BERT model, and outputting the category and the label corresponding to the target document.
In practical applications, the user may be a publisher who issues stocks in an issuing enterprise, the target document may be an investment relation certification document uploaded by the publisher, and the investment relation certification document is a certification document representing the investment relation between the publisher and the issuing enterprise (or issuing agency).
Specifically, the investment relationship may include the following:
the claimant is an employee of the issuing enterprise or a former employee;
the buyer is the capital investor of the issuing enterprise;
the claimant is the stockholder of the issuing enterprise;
the claimant is the stockholder of the agent (or the sponsors).
In step S101, after receiving the target document uploaded by the user, it is necessary to determine the correlation between the content of the target document and the investment relationship, and if the correlation is low, the target document uploaded by the user is considered to be invalid, which may prompt the user that the target document is not satisfactory and needs to be uploaded again.
Specifically, in step S101, the target keyword may be an employee, a stakeholder, an agent, or the like, and specifically, whether the target document uploaded by the user meets the preset requirement may be determined according to whether the number of the target keywords included in the document is greater than a preset threshold.
Specifically, the step S101 of determining whether the target document meets the preset requirement may specifically be implemented as: generating a document abstract of a target document through a preset automatic abstract tool, and extracting target keywords from the document abstract; in step S102, it is determined whether the target document meets a preset requirement according to the number of the extracted target keywords.
If the target document meets the preset requirement, the relevance between the content of the target document and the investment relation is high, and if the target document does not meet the preset requirement, the relevance between the content of the target document and the investment relation is low.
In step S103, if the target document meets the preset requirement, classifying and labeling the target document based on a pre-trained BERT model, and outputting a class and a label corresponding to the target document.
Specifically, a pre-trained BERT model may be used to perform content reading understanding and semantic analysis on the target document. The BERT is called Bidirectional Encoder reproduction from transforms, i.e. a multi-layer Bidirectional transcoder.
The BERT is a pre-training language model, and the pre-training aims to train partial models of the middle and bottom layers and commonalities of downstream tasks in advance and then train respective models by using respective sample data of the downstream tasks (such as machine translation, reading and understanding and the like), so that the convergence rate can be greatly increased. Sample data in this application may be multi-party data including publishing enterprises, agents and buyers, as well as publicly known related data.
According to some embodiments of the present application, the classifying and labeling operation performed on the target document based on the pre-trained BERT model in step S102 may be implemented as:
step S201: inputting the target document into a pre-trained BERT model to obtain an issuing enterprise entity, an agent entity and a buyer entity in the target document;
step S202: finding associated paragraphs of the publishing business entity and associated paragraphs of the agent entity in the target document according to the publishing business entity and the agent entity;
step S203: extracting the investment relationship between the issuing enterprise entity and the buyer entity from the association paragraph of the issuing enterprise entity and extracting the investment relationship between the agent entity and the buyer entity from the association paragraph of the agent entity by utilizing a pre-trained semantic analysis model;
step S204: and classifying the investment relations and marking corresponding labels according to the corresponding relation between the preset investment relations and the preset categories.
Specifically, machine content reading understanding and semantic analysis are carried out on the investment relation certification document, and an entity is identified. Publishing enterprises, agents and buyers are found in the entities, then relevant paragraphs are found for the identified publishing enterprise entities, the relationships between the entities are analyzed according to semantics, the relationships are identified and labeled, if the relationships are employee relationships, the documents are labeled with employee labels, if the relationships are keystone investor relationships, the documents are labeled with keystone investor labels, if the relationships are stockholder relationships, the documents are labeled with stockholder labels, and the like. If no relationship is identified, the user is prompted to upload documents that may not be in compliance, and is then asked to specify which descriptions in the documents may specify investment relationships or supplement other documents.
In the present application, the training process of the BERT model is as follows:
determining a training document and an initial BERT model;
inputting the training document into the initial BERT model;
acquiring the output of the initial BERT model to obtain training feature representation information corresponding to the training document;
determining the prediction type of the training document according to the training feature representation information;
determining the actual category of the training document, and obtaining feedback information according to the actual category and the prediction category;
and adjusting the model parameters of the initial BERT model according to the feedback information to obtain the BERT model.
In this application, the semantic analysis model is obtained by using a correlation model training, which is not described herein.
The document processing method provided by the embodiment of the application receives a target document uploaded by a user and judges whether the target document meets a preset requirement, wherein the preset requirement is that the number of target keywords contained in the document is greater than a preset threshold value; and if the target document meets the preset requirement, classifying and labeling the target document based on a pre-trained BERT model, and outputting the category and the label corresponding to the target document. Compared with the prior art, the method and the system have the advantages that the relations between the claimants and the issuing enterprises can be automatically classified and identified according to the investment relation certification documents uploaded by the claimants, manual identification of users is not needed, and consumption of manpower and time is greatly reduced.
According to some embodiments of the application, if the target document does not meet the preset requirement, a prompt message is sent to prompt a user that the target document does not meet the requirement and needs to be uploaded again.
According to some embodiments of the present application, as shown in fig. 2, the document processing method of the above embodiment may further include:
step S301: performing pedigree analysis on pre-acquired multi-party data including issuing enterprises, agents and buyers to perform authenticity check on the investment relation;
step S302: and updating the investment relation according to the authenticity checking result, classifying the updated investment relation and marking a corresponding label.
Specifically, taking the identification of the share holding relationship as an example: for example, the A buyer is actually a stockholder of the B company, and then the B company holds the stock issuing enterprise, so the A buyer is actually a stockholder of the issuing enterprise, and the relationship can be identified only through public data retrospective query and pedigree analysis. If the documents uploaded by the claimants are incomplete, some investment relations may be intentionally missed, so authenticity check of the investment relations needs to be performed through public data or data provided by issuing enterprises, and comprehensive identification of the investment relations of the claimants is ensured.
Specifically, the pedigree analysis is exemplified as follows:
a. using the published data, the issuing business stakeholders and types are analyzed, the individual types are recorded in a data table, person _ inv, and the business types are recorded in a data table, etp _ inv (table contains fields: business name, checked or not).
b. And (4) performing the step (a) analysis on the enterprises recorded in the etp _ inv, and continuously analyzing the stockholders and the types of the enterprises. If it is person type shareholder, it is recorded in table person _ inv, and if it is business type, it is recorded in data table etp _ inv.
After this check is completed, the enterprise checked (record status) is recorded in the data table etp _ inv.
c. The check etp _ inv is traversed individually for the shareholders of each business shareholder until all are individual shareholders.
d. And then, traversing and matching the individual shareholders in the data table person _ inv with the claimants one by one according to names or identity card numbers, and judging whether the matching exists or not, if so, judging whether the identified relationship of the investors of the claimants contains the supporting investment relationship, if not, prompting the claimants to further supplement and submit investment relationship certification documents, otherwise, failing to pass the claimants.
In the embodiment, the authenticity check of the investment relationship is carried out based on information such as equity information, financial data and the like provided by the issuing enterprise, the agency and the claimant and publicly known data, and the special relationship between the new stock claimant and the issuing enterprise and the agency is mined and correctly identified so as to meet the supervision requirement.
According to some embodiments of the present application, as shown in fig. 3, the document processing method of the above embodiment may further include:
step S303: and generating an audit report according to a preset report template based on the category and the label corresponding to the target document.
Specifically, based on the obtained investment relationship between the claimant and the listed enterprise (or the agent) and the authenticity check result, an audit report is generated according to a preset report template.
According to some embodiments of the present application, the document processing method of the above embodiment may further include: and correspondingly storing the audit report and the target document in a database for subsequent use.
According to the embodiment, the automatic classification identification is carried out on the relationship between the applicant and the issuing enterprise according to the investment relationship certification document uploaded by the applicant, manual identification of a user is not needed, the consumption of manpower and time is greatly reduced, and an audit report of the applicant is generated for an auditor to audit, so that the audit efficiency is improved.
In the embodiment, the document processing method is provided, and correspondingly, the application also provides a document processing device. The document processing apparatus provided in the embodiment of the present application may implement the document processing method, and the document processing apparatus may be implemented by software, hardware, or a combination of software and hardware. For example, the document processing device may comprise integrated or separate functional modules or units to perform the corresponding steps of the above-described methods. Referring to fig. 4, a schematic diagram of a document processing apparatus according to some embodiments of the present application is shown. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
As shown in fig. 4, the document processing apparatus 10 may include:
the receiving module 101 is configured to receive a target document uploaded by a user, generate a document abstract of the target document through a preset automatic abstract tool, and extract a target keyword from the document abstract;
the judging module 102 is configured to judge whether the target document meets a preset requirement according to the number of the extracted target keywords, where the preset requirement is that the number of the target keywords included in the document is greater than a preset threshold;
and the labeling module 103 is configured to, if the target document meets a preset requirement, perform classification and labeling operations on the target document based on a pre-trained BERT model, and output a category and a label corresponding to the target document.
In practical applications, the user may be a publisher who issues stocks in an issuing enterprise, the target document may be an investment relation certification document uploaded by the publisher, and the investment relation certification document is a certification document representing the investment relation between the publisher and the issuing enterprise (or issuing agency).
After receiving the target document uploaded by the user, judging the correlation between the content of the target document and the investment relation, if the correlation is very low, considering that the target document uploaded by the user is invalid, and prompting that the target document of the user is not in accordance with the requirement and needs to be uploaded again.
Specifically, the target keyword may be an employee, a shareholder, an agent, or the like, and specifically, whether the target document uploaded by the user meets a preset requirement may be determined according to whether the number of the target keywords included in the document is greater than a preset threshold.
If the target document meets the preset requirement, the relevance between the content of the target document and the investment relation is high, and if the target document does not meet the preset requirement, the relevance between the content of the target document and the investment relation is low.
In some implementations of embodiments of the present application, the document processing apparatus 10 further includes:
and the prompting module is used for sending out prompting information to prompt a user that the target document does not meet the preset requirement and needs to be uploaded again if the target document does not meet the preset requirement.
In some implementations of the embodiments of the present application, the tagging module 102 is specifically configured to:
inputting the target document into a pre-trained BERT model to obtain an issuing enterprise entity, an agent entity and a buyer entity in the target document;
finding associated paragraphs of the publishing business entity and associated paragraphs of the agent entity in the target document according to the publishing business entity and the agent entity;
extracting the investment relationship between the issuing enterprise entity and the buyer entity from the association paragraph of the issuing enterprise entity and extracting the investment relationship between the agent entity and the buyer entity from the association paragraph of the agent entity by utilizing a pre-trained semantic analysis model;
and classifying the investment relations and marking corresponding labels according to the corresponding relation between the preset investment relations and the preset categories.
In some implementations of embodiments of the present application, the investment relationship includes:
the claimant is an employee of the issuing enterprise or a former employee;
the buyer is the capital investor of the issuing enterprise;
the claimant is the stockholder of the issuing enterprise;
the buyer is the stakeholder of the agent.
Specifically, machine content reading understanding and semantic analysis are carried out on the investment relation certification document, and an entity is identified. Publishing enterprises, agents and buyers are found in the entities, then relevant paragraphs are found for the identified publishing enterprise entities, the relationships between the entities are analyzed according to semantics, the relationships are identified and labeled, if the relationships are employee relationships, the documents are labeled with employee labels, if the relationships are keystone investor relationships, the documents are labeled with keystone investor labels, if the relationships are stockholder relationships, the documents are labeled with stockholder labels, and the like. If no relationship is identified, the user is prompted to upload documents that may not be in compliance, and is then asked to specify which descriptions in the documents may specify investment relationships or supplement other documents.
In some implementations of embodiments of the present application, the document processing apparatus 10 further includes:
the checking module is used for classifying the investment relations and performing pedigree analysis on the pre-acquired multi-party data including the issuing enterprises, the agents and the claimants after the labeling module performs corresponding labeling on the investment relations so as to perform authenticity checking on the investment relations; and updating the investment relation according to the authenticity checking result, classifying the updated investment relation and marking a corresponding label.
According to some embodiments of the application, the document processing apparatus further comprises:
and the report generation module is used for generating an audit report according to a preset report template based on the category and the label corresponding to the target document.
Specifically, taking the identification of the share holding relationship as an example: for example, the A buyer is actually a stockholder of the B company, and then the B company holds the stock issuing enterprise, so the A buyer is actually a stockholder of the issuing enterprise, and the relationship can be identified only through public data retrospective query and pedigree analysis. If the documents uploaded by the claimants are incomplete, some investment relations may be intentionally missed, so authenticity check of the investment relations needs to be performed through public data or data provided by issuing enterprises, and comprehensive identification of the investment relations of the claimants is ensured.
In the embodiment, the authenticity check of the investment relationship is carried out based on information such as equity information, financial data and the like provided by the issuing enterprise, the agency and the claimant and publicly known data, and the special relationship between the new stock claimant and the issuing enterprise and the agency is mined and correctly identified so as to meet the supervision requirement.
In some implementations of embodiments of the present application, the document processing apparatus 10 further includes:
and the report generation module is used for generating an audit report according to a preset report template based on the category and the label corresponding to the target document.
Specifically, based on the obtained investment relationship between the claimant and the listed enterprise (or the agent) and the authenticity check result, an audit report is generated according to a preset report template.
According to the embodiment, the automatic classification identification is carried out on the relationship between the applicant and the issuing enterprise according to the investment relationship certification document uploaded by the applicant, manual identification of a user is not needed, the consumption of manpower and time is greatly reduced, and an audit report of the applicant is generated for an auditor to audit, so that the audit efficiency is improved.
In some implementations of embodiments of the present application, the document processing apparatus 10 further includes:
and the storage module is used for correspondingly storing the audit report and the target document in a database.
The document processing device provided by the embodiment of the application receives a target document uploaded by a user and judges whether the target document meets a preset requirement, wherein the preset requirement is that the number of target keywords contained in the document is greater than a preset threshold value; and if the target document meets the preset requirement, classifying and labeling the target document based on a pre-trained BERT model, and outputting the category and the label corresponding to the target document. Compared with the prior art, the device can automatically classify and identify the relationship between the claimant and the issuing enterprise according to the investment relationship certification document uploaded by the claimant, does not need manual identification of a user, and greatly reduces the consumption of manpower and time. And moreover, authenticity check can be automatically carried out on the investment relation, and an audit report is generated, so that the audit efficiency is improved.
The embodiment of the present application further provides an electronic device corresponding to the document processing method provided in the foregoing embodiment, where the electronic device may be a mobile phone, a notebook computer, a tablet computer, a desktop computer, or the like, so as to execute the document processing method.
Please refer to fig. 5, which illustrates a schematic diagram of an electronic device according to some embodiments of the present application. As shown in fig. 5, the electronic device 20 includes: the system comprises a processor 200, a memory 201, a bus 202 and a communication interface 203, wherein the processor 200, the communication interface 203 and the memory 201 are connected through the bus 202; the memory 201 stores a computer program that can be executed on the processor 200, and the processor 200 executes the document processing method provided by any one of the foregoing embodiments when executing the computer program.
The Memory 201 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 203 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 200. The Processor 200 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201 and completes the steps of the method in combination with the hardware thereof.
The electronic device provided by the embodiment of the application and the document processing method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic device.
Referring to fig. 6, the computer-readable storage medium is an optical disc 30, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program executes the document processing method provided in any of the foregoing embodiments.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the document processing method provided by the embodiment of the present application have the same beneficial effects as the method adopted, executed or implemented by the application program stored in the computer-readable storage medium.
It should be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present disclosure, and the present disclosure should be construed as being covered by the claims and the specification.
Claims (10)
1. A method of document processing, comprising:
receiving a target document uploaded by a user, generating a document abstract of the target document through a preset automatic abstract tool, and extracting target keywords from the document abstract;
judging whether the target document meets a preset requirement according to the number of the extracted target keywords, wherein the preset requirement is that the number of the target keywords contained in the document is larger than a preset threshold value;
and if the target document meets the preset requirement, classifying and labeling the target document based on a pre-trained BERT model, and outputting the category and the label corresponding to the target document.
2. The method of claim 1, further comprising:
and if the target document does not meet the preset requirement, sending a prompt message to prompt a user that the target document does not meet the requirement and needs to be uploaded again.
3. The method of claim 1, wherein the classifying and labeling the target document based on a pre-trained BERT model comprises:
inputting the target document into a pre-trained BERT model to obtain an issuing enterprise entity, an agent entity and a buyer entity in the target document;
finding associated paragraphs of the publishing business entity and associated paragraphs of the agent entity in the target document according to the publishing business entity and the agent entity;
extracting the investment relationship between the issuing enterprise entity and the buyer entity from the association paragraph of the issuing enterprise entity and extracting the investment relationship between the agent entity and the buyer entity from the association paragraph of the agent entity by utilizing a pre-trained semantic analysis model;
and classifying the investment relations and marking corresponding labels according to the corresponding relation between the preset investment relations and the preset categories.
4. The method of claim 3, wherein the training process of the BERT model comprises:
determining a training document and an initial BERT model;
inputting the training document into the initial BERT model;
acquiring the output of the initial BERT model to obtain training feature representation information corresponding to the training document;
determining the prediction type of the training document according to the training feature representation information;
determining the actual category of the training document, and obtaining feedback information according to the actual category and the prediction category;
and adjusting the model parameters of the initial BERT model according to the feedback information to obtain the BERT model.
5. The method of claim 3, wherein after classifying the investment relationships and labeling the investment relationships accordingly, further comprising:
performing pedigree analysis on pre-acquired multi-party data including issuing enterprises, agents and buyers to perform authenticity check on the investment relation;
and updating the investment relation according to the authenticity checking result, classifying the updated investment relation and marking a corresponding label.
6. The method according to any one of claims 1 to 4, further comprising:
and generating an audit report according to a preset report template based on the category and the label corresponding to the target document.
7. The method of claim 6, further comprising:
and correspondingly storing the audit report and the target document in a database.
8. A document processing apparatus, comprising:
the receiving module is used for receiving a target document uploaded by a user, generating a document abstract of the target document through a preset automatic abstract tool, and extracting a target keyword from the document abstract;
the judging module is used for judging whether the target document meets a preset requirement according to the number of the extracted target keywords, wherein the preset requirement is that the number of the target keywords contained in the document is larger than a preset threshold value;
and the labeling module is used for classifying and labeling the target document based on a pre-trained BERT model and outputting the category and the label corresponding to the target document if the target document meets the preset requirement.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to implement the method according to any of claims 1 to 7.
10. A computer-readable storage medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a processor to implement the method of any one of claims 1 to 7.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011583169.6A CN112732897A (en) | 2020-12-28 | 2020-12-28 | Document processing method and device, electronic equipment and storage medium |
PCT/CN2021/096932 WO2022142116A1 (en) | 2020-12-28 | 2021-05-28 | Method and apparatus for processing document, and electronic device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011583169.6A CN112732897A (en) | 2020-12-28 | 2020-12-28 | Document processing method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112732897A true CN112732897A (en) | 2021-04-30 |
Family
ID=75606820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011583169.6A Pending CN112732897A (en) | 2020-12-28 | 2020-12-28 | Document processing method and device, electronic equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112732897A (en) |
WO (1) | WO2022142116A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113361253A (en) * | 2021-05-28 | 2021-09-07 | 北京金山数字娱乐科技有限公司 | Recognition model training method and device |
CN113505217A (en) * | 2021-07-29 | 2021-10-15 | 永道科技有限公司 | Method and system for realizing rapid formation of project cost database based on big data |
WO2022142116A1 (en) * | 2020-12-28 | 2022-07-07 | 平安科技(深圳)有限公司 | Method and apparatus for processing document, and electronic device and storage medium |
CN115827940A (en) * | 2023-02-17 | 2023-03-21 | 北京网智易通科技有限公司 | Electronic archive filing method and device, electronic equipment and storage medium |
CN116797329A (en) * | 2022-04-29 | 2023-09-22 | 朱芷叶 | Abnormal data alarming method, device, computer equipment and storage medium |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116561348B (en) * | 2023-07-10 | 2023-09-19 | 中国电子科技集团公司第十五研究所 | Method and system for extracting and processing information of increase and decrease of stakeholders |
CN117891447A (en) * | 2024-03-14 | 2024-04-16 | 蒲惠智造科技股份有限公司 | Enterprise management software development method, device, equipment and medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104360993A (en) * | 2014-11-19 | 2015-02-18 | 广州极盛信息科技开发有限公司 | Method for extracting needed content from text |
CN109446527A (en) * | 2018-10-26 | 2019-03-08 | 广东小天才科技有限公司 | Meaningless corpus analysis method and system |
CN109670788A (en) * | 2018-12-13 | 2019-04-23 | 平安医疗健康管理股份有限公司 | Medical insurance checking method, device, equipment and storage medium based on data analysis |
CN109710918A (en) * | 2018-11-26 | 2019-05-03 | 平安科技(深圳)有限公司 | Public sentiment relation recognition method, apparatus, computer equipment and storage medium |
CN110413775A (en) * | 2019-06-25 | 2019-11-05 | 北京清博大数据科技有限公司 | A kind of data label classification method, device, terminal and storage medium |
CN110427623A (en) * | 2019-07-24 | 2019-11-08 | 深圳追一科技有限公司 | Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium |
CN110781299A (en) * | 2019-09-18 | 2020-02-11 | 平安科技(深圳)有限公司 | Asset information identification method and device, computer equipment and storage medium |
CN111737416A (en) * | 2020-06-29 | 2020-10-02 | 重庆紫光华山智安科技有限公司 | Case processing model training method, case text processing method and related device |
CN111859922A (en) * | 2020-07-31 | 2020-10-30 | 上海银行股份有限公司 | Application method of entity relation extraction technology in bank wind control |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11163806B2 (en) * | 2016-05-27 | 2021-11-02 | International Business Machines Corporation | Obtaining candidates for a relationship type and its label |
US10394959B2 (en) * | 2017-12-21 | 2019-08-27 | International Business Machines Corporation | Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources |
CN109447412A (en) * | 2018-09-26 | 2019-03-08 | 平安科技(深圳)有限公司 | Construct method, apparatus, computer equipment and the storage medium of business connection map |
US11687827B2 (en) * | 2018-10-04 | 2023-06-27 | Accenture Global Solutions Limited | Artificial intelligence (AI)-based regulatory data processing system |
CN111651552B (en) * | 2020-06-08 | 2024-04-23 | 中国工商银行股份有限公司 | Structured information determining method and device and electronic equipment |
CN111968700A (en) * | 2020-07-07 | 2020-11-20 | 南京农业大学 | Method and system for extracting rice phenomics knowledge map relation based on BERT |
CN112732897A (en) * | 2020-12-28 | 2021-04-30 | 平安科技(深圳)有限公司 | Document processing method and device, electronic equipment and storage medium |
-
2020
- 2020-12-28 CN CN202011583169.6A patent/CN112732897A/en active Pending
-
2021
- 2021-05-28 WO PCT/CN2021/096932 patent/WO2022142116A1/en active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104360993A (en) * | 2014-11-19 | 2015-02-18 | 广州极盛信息科技开发有限公司 | Method for extracting needed content from text |
CN109446527A (en) * | 2018-10-26 | 2019-03-08 | 广东小天才科技有限公司 | Meaningless corpus analysis method and system |
CN109710918A (en) * | 2018-11-26 | 2019-05-03 | 平安科技(深圳)有限公司 | Public sentiment relation recognition method, apparatus, computer equipment and storage medium |
CN109670788A (en) * | 2018-12-13 | 2019-04-23 | 平安医疗健康管理股份有限公司 | Medical insurance checking method, device, equipment and storage medium based on data analysis |
CN110413775A (en) * | 2019-06-25 | 2019-11-05 | 北京清博大数据科技有限公司 | A kind of data label classification method, device, terminal and storage medium |
CN110427623A (en) * | 2019-07-24 | 2019-11-08 | 深圳追一科技有限公司 | Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium |
CN110781299A (en) * | 2019-09-18 | 2020-02-11 | 平安科技(深圳)有限公司 | Asset information identification method and device, computer equipment and storage medium |
CN111737416A (en) * | 2020-06-29 | 2020-10-02 | 重庆紫光华山智安科技有限公司 | Case processing model training method, case text processing method and related device |
CN111859922A (en) * | 2020-07-31 | 2020-10-30 | 上海银行股份有限公司 | Application method of entity relation extraction technology in bank wind control |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022142116A1 (en) * | 2020-12-28 | 2022-07-07 | 平安科技(深圳)有限公司 | Method and apparatus for processing document, and electronic device and storage medium |
CN113361253A (en) * | 2021-05-28 | 2021-09-07 | 北京金山数字娱乐科技有限公司 | Recognition model training method and device |
CN113361253B (en) * | 2021-05-28 | 2024-04-09 | 北京金山数字娱乐科技有限公司 | Recognition model training method and device |
CN113505217A (en) * | 2021-07-29 | 2021-10-15 | 永道科技有限公司 | Method and system for realizing rapid formation of project cost database based on big data |
CN116797329A (en) * | 2022-04-29 | 2023-09-22 | 朱芷叶 | Abnormal data alarming method, device, computer equipment and storage medium |
CN115827940A (en) * | 2023-02-17 | 2023-03-21 | 北京网智易通科技有限公司 | Electronic archive filing method and device, electronic equipment and storage medium |
CN115827940B (en) * | 2023-02-17 | 2024-01-26 | 北京网智易通科技有限公司 | Method and device for archiving electronic files, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2022142116A1 (en) | 2022-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112732897A (en) | Document processing method and device, electronic equipment and storage medium | |
CN107945024B (en) | Method for identifying internet financial loan enterprise operation abnormity, terminal equipment and storage medium | |
Tang et al. | Incorporating textual and management factors into financial distress prediction: A comparative study of machine learning methods | |
CN110781299B (en) | Asset information identification method, device, computer equipment and storage medium | |
Spilnyk et al. | Accounting and financial reporting system in the digital economy | |
US11907299B2 (en) | System and method for implementing a securities analyzer | |
CN106649223A (en) | Financial report automatic generation method based on natural language processing | |
US10579651B1 (en) | Method, system, and program for evaluating intellectual property right | |
CN114202755A (en) | Transaction background authenticity auditing method and system based on OCR (optical character recognition) and NLP (non-line segment) technologies | |
CN109992752B (en) | Label marking method, device, computer device and storage medium for contract file | |
Cooper et al. | Text-mining 10-K (annual) reports: a guide for B2B marketing research | |
CN117764039A (en) | Bid file generation method, system, terminal and storage medium based on large model | |
US11880394B2 (en) | System and method for machine learning architecture for interdependence detection | |
TW202018616A (en) | Intelligent accounting system and identification method for accounting documents | |
CN112465564A (en) | Supplier recommendation method, device and terminal | |
CN112434504A (en) | Method and device for generating file information, electronic equipment and computer readable medium | |
CN115098629B (en) | File processing method, device, server and readable storage medium | |
US20160343086A1 (en) | System and method for facilitating interpretation of financial statements in 10k reports by linking numbers to their context | |
Faboyede et al. | The impact of extensible business reporting language education and adoption on stock exchange development: a focus on Nigeria | |
CN111797608B (en) | Credit data checking method and device | |
US11379445B2 (en) | System and method for analyzing and structuring data records | |
Ilias | The practitioner's expectation of real-time reporting: Case of the eXtensible business reporting language (XBRL) | |
CN113807339A (en) | Data processing method, device and equipment | |
Sneed | Requirement-based testing-extracting logical test cases from requirement documents | |
Alles et al. | The case for an app-based financial reporting system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210430 |