Nothing Special   »   [go: up one dir, main page]

WO2022041714A1 - Document processing method and apparatus, electronic device, storage medium, and program - Google Patents

Document processing method and apparatus, electronic device, storage medium, and program Download PDF

Info

Publication number
WO2022041714A1
WO2022041714A1 PCT/CN2021/083679 CN2021083679W WO2022041714A1 WO 2022041714 A1 WO2022041714 A1 WO 2022041714A1 CN 2021083679 W CN2021083679 W CN 2021083679W WO 2022041714 A1 WO2022041714 A1 WO 2022041714A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
feature
processed
information
file package
Prior art date
Application number
PCT/CN2021/083679
Other languages
French (fr)
Chinese (zh)
Inventor
陈嘉航
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2022041714A1 publication Critical patent/WO2022041714A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • G06F8/315Object-oriented languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2107File encryption

Definitions

  • the present application relates to the field of document management of financial technology (Fintech), and relates to, but is not limited to, a document processing method, apparatus, electronic device, computer-readable storage medium and computer program.
  • Embodiments of the present application provide a document processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program.
  • the embodiment of the present application provides a document processing method, the method includes:
  • the configuration file includes the identifier of the target feature of the document to be processed and the path information of the file package provided by the third-party platform;
  • the file package includes a feature extraction method that characterizes the target feature first information;
  • the target feature is extracted from the document to be processed.
  • the file package includes a custom class, and the first information is located in the custom class;
  • the method further includes: loading the custom class in the file package through a reflection mechanism of a programming language, and acquiring the first information from the loaded custom class.
  • the custom class in the file package can be loaded through the reflection mechanism of the programming language, that is, regardless of whether the custom class in the file package is known or unknown, it can be based on The principle of the reflection mechanism of the programming language does not need to introduce the custom class in the file package in advance, and can realize the loading of the custom class in the file package; in the case of receiving the file package in real time, the custom class in the file package can be realized dynamic loading.
  • the configuration file further includes second information, where the second information includes: an identifier of the file package and/or an identifier of the custom class;
  • the loading of the custom class in the file package through the reflection mechanism of the programming language includes:
  • the custom class in the file package is loaded through the reflection mechanism of the programming language.
  • the second information in the configuration file is the information agreed with the third-party platform in advance, it means that the second information in the configuration file is correct information.
  • the class is beneficial to accurately obtain the first information from the custom class, and further, it is beneficial to accurately extract the target feature.
  • the method further includes:
  • decryption after receiving the configuration file sent by the third-party platform, decryption can be performed based on the decryption method corresponding to the preset encryption method. Therefore, the encrypted transmission of the second information can be realized, which is beneficial to improve the second information. Information security, reducing the risk of second information being attacked.
  • the document processing method further includes:
  • the obtaining the first information from the loaded custom class includes:
  • the custom class is instantiated as an object, and when the object belongs to the abstract class, the first information is obtained from the loaded custom class.
  • the custom class when the object instantiated by the custom class belongs to an abstract class, it can be considered that the custom class is the correct class, and on this basis, it is beneficial to accurately obtain from the custom class
  • the first information in turn, facilitates accurate extraction of target features.
  • the method further includes:
  • the target feature is extracted from the document to be processed based on a predetermined extraction method of the default feature.
  • the embodiment of the present application does not need to obtain the target feature extraction method from the third-party platform, but can realize the target feature extraction based on the predetermined default feature extraction method, which is easy to implement.
  • the method further includes:
  • a quality score is performed on the document to be processed based on the target feature, and a quality score value of the document to be processed is obtained.
  • the embodiments of the present application can implement the quality assessment of the document to be processed on the basis of the target feature, which is beneficial to realize the management of the document to be processed on the basis of the quality assessment of the document to be processed.
  • the target feature includes at least two features; the configuration file includes weight information of each of the at least two features;
  • Performing a quality score on the document to be processed based on the target feature to obtain a quality score value of the document to be processed including:
  • a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
  • the embodiment of the present application can implement the quality assessment of the document to be processed by performing weighted summation of each feature of the target feature, which is beneficial to realize the management of the document to be processed based on the quality assessment of the document to be processed.
  • the extracting the target feature from the document to be processed includes:
  • the word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain length-related features, each of which corresponds to a value; the document feature vector of the to-be-processed document is extracted, and the The cosine similarity between the document feature vector of the document to be processed and the document feature vector of the preset template is used as the template correlation feature; the part of speech is determined according to the number ratio of the preset part of speech in the document to be processed to all words in the document to be processed. relevant features;
  • At least two of length-related features, template-related features, and part-of-speech-related features are used as the target features.
  • the embodiments of the present application can implement the quality evaluation of the document to be processed based on the length-related features, template-related features, and part-of-speech features, that is, the quality of the to-be-processed document can be accurately evaluated from multiple aspects.
  • the extracting the target feature from the document to be processed includes:
  • the word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain a first feature, and each of the word count intervals corresponds to a value; the sentences of the document to be processed are averaged
  • the length is discretized data processing according to a plurality of predetermined sentence length intervals to obtain the second feature, and each sentence length interval corresponds to a value;
  • the document error number of the document to be processed is used as the independent variable of the exponential function, Obtain the value of the exponential function, and use the value of the exponential function as the third feature; perform discretization data processing on the number of advanced words of the document to be processed according to a plurality of predetermined intervals of the number of advanced words , to obtain the fourth feature, each of the high-level vocabulary count intervals corresponds to a value, and the high-level vocabulary represents a vocabulary located in a predetermined high-level vocabulary;
  • At least two of the first feature, the second feature, the third feature, and the fourth feature are used as the target feature.
  • the embodiment of the present application can implement the quality evaluation of the document to be processed based on the first feature, the second feature, the third feature and the fourth feature, while the first feature, the second feature, the third feature and the fourth feature There are four different characteristics. Therefore, the embodiments of the present application can accurately evaluate the quality of the document to be processed from various aspects.
  • An embodiment of the present application provides a document processing device, and the device includes:
  • the first obtaining module is configured to obtain the document to be processed
  • a receiving module configured to receive a configuration file sent by a third-party platform, the configuration file includes an identifier of a target feature of the document to be processed and path information of a file package provided by the third-party platform; the file package includes a file representing the target the first information of the feature extraction method of the feature;
  • a second acquiring module configured to acquire the file package based on the path information of the file package when the identifier of the target feature is different from the identifier of the default feature
  • a processing module configured to extract the target feature from the document to be processed based on the first information in the file package.
  • the file package includes a custom class, and the first information is located in the custom class;
  • the second obtaining module is further configured to load the custom class in the file package through the reflection mechanism of the programming language, and obtain the first information from the loaded custom class.
  • the configuration file further includes second information, where the second information includes: an identifier of the file package and/or an identifier of the custom class;
  • the second acquisition module is configured to load the custom class in the file package through the reflection mechanism of the programming language, including:
  • the custom class in the file package is loaded through the reflection mechanism of the programming language.
  • the second obtaining module is further configured to obtain a preset encryption method of the second information; based on the decryption method corresponding to the encryption method of the second information, the configuration The encrypted information in the file is decrypted to obtain the second information; wherein the encrypted information is obtained by encrypting the second information based on the encryption method.
  • the second obtaining module is further configured to predetermine an abstract class, and set the custom class to inherit the predetermined abstract class;
  • the second obtaining module is configured to obtain the first information from the loaded custom class, including:
  • the custom class is instantiated as an object, and when the object belongs to the abstract class, the first information is obtained from the loaded custom class.
  • the processing module is further configured to, in the case that the identifier of the target feature is the same as the identifier of the default feature, based on the predetermined extraction method of the default feature, in the The target feature is extracted from the processed document.
  • the processing module is further configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed.
  • the target feature includes at least two features; the configuration file includes weight information of each of the at least two features;
  • the processing module is configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed, including:
  • a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
  • the processing module configured to extract the target feature from the document to be processed, includes:
  • the word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain length-related features, each of which corresponds to a value; the document feature vector of the to-be-processed document is extracted, and the The cosine similarity between the document feature vector of the document to be processed and the document feature vector of the preset template is used as the template correlation feature; the part of speech is determined according to the number ratio of the preset part of speech in the document to be processed to all words in the document to be processed. relevant features;
  • At least two of length-related features, template-related features, and part-of-speech-related features are used as the target features.
  • the processing module configured to extract the target feature from the document to be processed, includes:
  • the word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain a first feature, and each of the word count intervals corresponds to a value; the sentences of the document to be processed are averaged
  • the length is discretized data processing according to a plurality of predetermined sentence length intervals to obtain the second feature, and each sentence length interval corresponds to a value;
  • the document error number of the document to be processed is used as the independent variable of the exponential function, Obtain the value of the exponential function, and use the value of the exponential function as the third feature; perform discretization data processing on the number of advanced words of the document to be processed according to a plurality of predetermined intervals of the number of advanced words , to obtain the fourth feature, each of the high-level vocabulary count intervals corresponds to a value, and the high-level vocabulary represents a vocabulary located in a predetermined high-level vocabulary;
  • At least two of the first feature, the second feature, the third feature, and the fourth feature are used as the target feature.
  • An embodiment of the present application provides an electronic device, and the electronic device includes:
  • a memory configured to store executable instructions
  • the processor is configured to implement any one of the above document processing methods when executing the executable instructions stored in the memory.
  • Embodiments of the present application provide a computer-readable storage medium storing executable instructions for implementing any one of the foregoing document processing methods when executed by a processor.
  • An embodiment of the present application provides a computer program, including computer-readable code, when the computer-readable code is executed in an electronic device, the processor in the electronic device executes any one of the above document processing methods.
  • a document to be processed is obtained; a configuration file sent by a third-party platform is received, where the configuration file includes an identifier of a target feature of the document to be processed and path information of a file package provided by the third-party platform; the file The package includes first information that characterizes the feature extraction method of the target feature; when the identifier of the target feature is different from the identifier of the default feature, the file package is acquired based on the path information of the file package; based on the The first information in the file package extracts the target feature from the document to be processed.
  • FIG. 1 is a schematic diagram of an application scenario of an embodiment of the present application.
  • Fig. 2 is an optional flowchart of the document processing method provided by the embodiment of the present application.
  • Fig. 3 is a flow chart of realizing the encrypted transmission of information in the configuration file in the embodiment of the present application.
  • Fig. 4 is another optional flowchart of the document processing method provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of an optional composition structure of a document processing apparatus according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an optional composition structure of an electronic device provided by an embodiment of the present application.
  • document management can be achieved by manually evaluating document quality.
  • This will increase a lot of labor costs, and each person's document evaluation criteria cannot be saved as experience.
  • the method of manually evaluating document quality is still subject to strong subjectivity and inconvenience.
  • the problem is not objective enough; in the related art, the features of a certain type of document can also be extracted based on feature engineering, and then the document quality can be evaluated based on the extracted features.
  • a certain type of document can be in English. Composition, Chinese composition, etc.; for different types of documents, different types of features may need to be extracted. Therefore, in order to extract different types of features, different feature extraction models need to be developed and deployed, or different feature libraries need to be developed. For the feature extraction model, new program code needs to be written and deployed locally, which increases time and labor costs.
  • the embodiments of the present application provide a document processing method, apparatus, device, and computer-readable storage medium; the document processing methods provided by the embodiments of the present application can be applied to electronic devices, and exemplary electronic devices provided by the embodiments of the present application are described below.
  • the electronic device provided by the embodiments of the present application can be implemented as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (eg, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), etc. .
  • FIG. 1 is a schematic diagram of an application scenario of an embodiment of the present application.
  • the electronic device 100 may connect to the third-party platform 102 through the network 101;
  • the network 101 may be a wide area network or a local area network, or a combination of the two;
  • the tripartite platform 102 can be implemented based on a terminal and/or a server, and the terminal can be a tablet computer, a notebook computer, a desktop computer, etc., but is not limited to this;
  • the server can be an independent physical server, or a server composed of multiple physical servers
  • a cluster or distributed system can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, and Content Delivery Network (CDN). ), as well as cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • CDN Content Delivery Network
  • the third-party platform 102 may acquire the document to be processed and send the document to be processed to the electronic device 100;
  • the type of the document to be processed may be any type, and in some embodiments, the document to be processed may be It is a Chinese document, an English document or a document in other languages; in some embodiments, the document to be processed may be a plan document, log data of an electronic device, or other documents; it should be noted that the content recorded above is only for the document to be processed. Types are exemplified, and the embodiments of the present application are not limited thereto.
  • the electronic device 100 may obtain the document to be processed locally, or download the document to be processed from the network 101 ; the electronic device 100 may send the document to be processed to the third-party platform 102 .
  • the third-party platform 102 can determine the target feature of the document to be processed and the feature extraction method of the target feature, and generate a configuration file, the configuration file at least includes the identification of the target feature of the document to be processed and the third-party platform 102 provides the target feature.
  • the first information may be a program code implementing a feature extraction method of the target feature.
  • the third-party platform 102 determines the target feature according to the actual feature extraction requirement.
  • the target feature may be one feature or may include multiple features.
  • the identifier of the target feature may be a name, a serial number, or other identifiers.
  • the file package may include: a code collection that provides at least one function in an object-oriented programming language; exemplarily, the object-oriented programming language may be JAVA language, C++ language, etc., in the object-oriented programming language
  • the language can be the JAVA language
  • the above file package can be a jar package.
  • the third-party platform 102 may send the configuration file and file package to the electronic device 100 .
  • the path information of the file package may indicate the storage location of the file package in the electronic device 100; the electronic device 100 may determine the storage location of the file package according to the configuration file, extract the first information in the file package, and store the file package according to the first information. The information extracts target features in the document to be processed.
  • the document processing method according to the embodiment of the present application is exemplarily described below with reference to the application scenario shown in FIG. 1 .
  • FIG. 2 is an optional flowchart of a document processing method provided by an embodiment of the present application. As shown in FIG. 2 , the flowchart may include:
  • Step 201 Obtain the document to be processed.
  • Step 202 Receive the configuration file sent by the third-party platform.
  • Step 203 In the case that the identifier of the target feature is different from the identifier of the default feature, acquire the file package based on the path information of the file package.
  • the default feature is a feature predetermined by the electronic device, and for the default feature, the extraction method of the default feature is also predetermined.
  • the file package can be read based on the path information of the file package in the configuration file. .
  • Step 204 Extract target features from the document to be processed based on the first information in the file package.
  • the first information represents the feature extraction method of the target feature. Therefore, based on the first information, the feature extraction method of the target feature can be determined, and further, the target feature can be extracted from the document to be processed.
  • the feature extraction method of the target feature is implemented based on a natural language processing (Natural Language Processing, NLP) method or other document processing methods.
  • the feature extraction method of the target feature may include a first method and a second method, wherein the first method may be denoted as a doCalculator method, and the second method may be denoted as a featureCalculate method.
  • processing the document to be processed based on the first method may include: 1) using the NLP method to segment the document to be processed, and then to count data of word granularity; 2) using the NLP method to segment the document to be processed, and then Statistical sentence granularity data; 3) Remove high-frequency words and modal particles and perform denoising processing; 4) Extract data such as main title, subtitle, font size and other data in the document to be processed, for example, the JAVA application programming interface for Microsoft documents can be used (the JAVA Application Programming Interface for Microsoft Document, Apache POI) Extract data such as the main title, subtitle, font size and other data in the document to be processed.
  • the JAVA application programming interface for Microsoft documents can be used (the JAVA Application Programming Interface for Microsoft Document, Apache POI) Extract data such as the main title, subtitle, font size and other data in the document to be processed.
  • different language packages may be used to process the to-be-processed document according to different languages of the to-be-processed document.
  • a Chinese language processing package Han Language Processing, HanLP
  • the to-be-processed document is segmented or sentenced; when the to-be-processed document is an English document, an English language processing package can be used to segment the to-be-processed document or a sentence.
  • a preliminary processing result of the document to be processed can be obtained, and the preliminary processing result includes the value of the feature; then, the preliminary processing of the document to be processed based on the second method can be obtained.
  • the result is further processed, for example, based on the second method, discrete feature values may be normalized, and continuous feature values may be averaged.
  • the program code when the first information is a program code for implementing a feature extraction method for a target feature, the program code may be executed to obtain the target feature.
  • steps 201 to 204 may be implemented based on a processor of an electronic device, and the above-mentioned processor may be an application-specific integrated circuit (ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital Signal Processing Device (Digital Signal Processing Device, DSPD), Programmable Logic Device (Programmable Logic Device, PLD), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), Central Processing Unit (Central Processing Unit, CPU), control at least one of a device, a microcontroller, and a microprocessor.
  • ASIC application-specific integrated circuit
  • DSP Digital Signal Processor
  • DSPD Digital Signal Processing Device
  • PLD Programmable Logic Device
  • Field Programmable Gate Array Field Programmable Gate Array
  • FPGA Field Programmable Gate Array
  • CPU Central Processing Unit
  • the third-party platform can modify, add or delete the identification of the target feature in the configuration file, and modify the content of the file package, so that the electronic device does not need to be
  • the new program code written and run locally can extract the target features directly based on the received configuration files and file packages.
  • the above-mentioned file package includes a custom class, and the above-mentioned first information is located in the custom class.
  • the class in the file package represents the collective name or collection of some objects with the same attributes and behaviors in the object-oriented programming language.
  • the object is the abstraction of objective things, and the class is the abstraction of the object, which is an abstract data type; After customizing the class, the third-party platform can set the first information in the custom class.
  • the self-defined class in the file package can also be loaded through the reflection mechanism of the programming language, and the first information can be obtained from the loaded self-defined class.
  • the reflection mechanism of the programming language refers to the ability of the program to access, detect and modify its own state or behavior; in an example, the reflection mechanism of the JAVA language refers to the ability to construct any arbitrary state or behavior in the running state of the program.
  • An object of a class can know the class to which any object belongs, the member variables and methods of any class, and the properties and methods of any object. This function of dynamically obtaining program information and dynamically calling objects is called the reflection mechanism of JAVA language.
  • the import method is usually used to load the classes in the file package.
  • the classes imported into the file package need to be extracted, so The class of the file package needs to be known in advance; if the class of the file package is unknown, the class of the file package cannot be loaded by the imported method; the class in the file package cannot be dynamically loaded according to the class of the file package received in real time.
  • the self-defined class in the file package can be loaded through the reflection mechanism of the programming language, that is, regardless of whether the custom class in the file package is known or unknown, it can be based on the program language.
  • the principle of reflection mechanism does not require the introduction of custom classes in the file package in advance, and the loading of custom classes in the file package can be realized; in the case of receiving the file package in real time, the dynamic loading of the custom class in the file package can be realized .
  • the electronic device may agree with the third-party platform in advance on the identifier of the file package and/or the identifier of the custom class in the file package.
  • the identifier of the file package may be the name of the file package or other Identification
  • the identification of the custom class in the file package can be the name of the custom class, the number of the custom class or other identifications.
  • the third-party platform receives a malicious attack or the third-party platform does not generate the file package according to the agreed requirements.
  • the identifier of the file package or the identifier of the custom class of the file package is different, the identifier of the file package sent by the third-party platform is different from the identifier of the agreed file package, and/or the identifier of the custom class in the file package sent by the third-party platform is different from the identifier of the custom class.
  • There is a difference in the identification of the agreed custom class which will cause the file package provided by the third-party platform to not meet the actual requirements.
  • the above configuration file may further include second information, where the second information includes: an identifier of a file package provided by a third-party platform and/or an identifier of the above-mentioned custom class.
  • an implementation manner of loading the custom class in the file package may be, after determining that the second information in the configuration file is the information pre-agreed with the third-party platform.
  • load the custom class in the file package may be, after determining that the second information in the configuration file is the information pre-agreed with the third-party platform.
  • the received file package may be ignored.
  • the second information in the configuration file is the information agreed with the third-party platform in advance, it means that the second information in the configuration file is correct information.
  • the class is beneficial to accurately obtain the first information from the custom class, and further, it is beneficial to accurately extract the target feature.
  • the file package provided by the third-party platform is not authenticated. Therefore, if a malicious attacker such as a hacker learns the information such as the custom class name in the file package, it can be realized by imitating the file package. Attacks on electronic devices.
  • the electronic device may also obtain a preset encryption method of the second information; correspondingly, after receiving the configuration file sent by the third-party platform, based on the second information
  • the encrypted information in the configuration file is decrypted according to the decryption method corresponding to the encryption method, and the second information is obtained; wherein, the encrypted information is obtained by encrypting the second information based on the above-mentioned encryption method.
  • the electronic device may obtain a preset encryption mode of the second information before receiving the configuration file sent by the third-party platform; for example, the preset encryption mode of the second information may be the electronic device and the third party.
  • the platform agrees on the encryption method of the second information.
  • the second information after the third-party platform and the electronic device agree on the encryption method of the second information, after generating the second information, the second information can be encrypted by using the agreed encryption method to obtain the encrypted information; then, the configuration file including the encrypted information can be encrypted. sent to electronic device.
  • the above encryption method and decryption method may be set according to actual conditions.
  • the encryption method and the decryption method may be determined based on a symmetric encryption method such as the Data Encryption Standard (DES), or may be determined based on a non-symmetric encryption method.
  • the symmetric encryption method determines the encryption method and the decryption method.
  • FIG. 3 is a flowchart of implementing encrypted transmission of information in a configuration file in an embodiment of the application.
  • the process of implementing encrypted transmission of information in the configuration file may include: :
  • Step 301 The electronic device sends the public key and the private key to the third-party platform,
  • the electronic device may agree with the third-party platform on the above-mentioned second information; the electronic device may store the public key, the private key and the agreed second information in a database, so as to facilitate subsequent verification;
  • Step 302 The third-party platform encrypts the second information by using the private key.
  • the third-party platform after receiving the private key, does not need to directly encrypt the file package and the classes in the file package, but after writing the second information into the configuration file, uses the private key to encrypt the first
  • the second information is encrypted.
  • Step 303 The third-party platform writes the public key corresponding to the private key into the configuration file, and sends the configuration file to the electronic device.
  • the third-party platform after encrypting the second information of the configuration file with the private key, and writing the public key corresponding to the private key into the configuration file, the third-party platform can send the configuration file to the electronic device.
  • the configuration file further includes the identification of the feature extraction method of the target feature.
  • the third-party platform can also use the private key to encrypt the identification of the feature extraction method of the target feature; wherein, the feature extraction method of the target feature
  • the identification of the method can be information such as name.
  • Step 304 The electronic device searches for the private key corresponding to the public key.
  • the electronic device after receiving the configuration file, the electronic device can read the path information and the public key in the configuration file, and search the database for the private key corresponding to the public key.
  • Step 305 The electronic device decrypts the encrypted information in the configuration file by using the private key.
  • both the above steps 304 and 305 may be implemented by a program running in an electronic device.
  • the file package is a correct data package.
  • the embodiment of the present application can make the third-party platform encrypt the second information by agreeing on the encryption method of the second information in the configuration file, and after receiving the configuration file sent by the third-party platform, it can be based on the third-party platform.
  • the decryption method corresponding to the encryption method agreed by the platform is decrypted, so the encrypted transmission of the second information can be realized, which is beneficial to improve the security of the second information and reduce the risk of the second information being attacked.
  • the electronic device may predetermine an abstract class, and set a custom class to inherit the predetermined abstract class; for example, the electronic device may agree with a third-party platform that the custom class inherits the predetermined abstract class abstract class.
  • an abstract class represents a class that cannot be instantiated as an object; inheritance is a concept in object-oriented software technology, which can make a subclass have the properties and methods of the parent class, or make a subclass inherit methods from the parent class, so that the subclass can have the properties and methods of the parent class. Has the same behavior as the parent class.
  • the electronic device can, through the interaction of the third-party platform, agree that the custom class in the file package inherits the abstract class; it is understandable that although the electronic device and the third-party platform agree that the custom class inherits the predetermined abstract class, However, when the third-party platform receives a malicious attack or the third-party platform does not inherit the abstract class as agreed upon, the classes in the file package provided by the third-party platform do not actually inherit the above-mentioned abstract class.
  • the custom class in the file package provided by the third-party platform may cause that the electronic device cannot obtain the first information from the custom class.
  • the implementation manner of obtaining the first information from the custom class may be to instantiate the custom class as an object, and if the object belongs to an abstract class, obtain the first information from the loaded custom class. Class to get the first information.
  • the electronic device after determining that the received file package is a correct data package, the electronic device needs to determine whether the class in the file package inherits the above-mentioned predetermined abstract class;
  • the custom class loader URLClassloader supports the JAVA reflection function by setting the setAccessible parameter. In this way, the custom class loader URLClassloader can be used to load the custom class in the file package and instantiate the loaded custom class as an object.
  • the custom class when the object instantiated by the custom class belongs to an abstract class, it can be considered that the custom class is the correct class, and on this basis, it is beneficial to accurately obtain from the custom class
  • the first information in turn, facilitates accurate extraction of target features.
  • the target feature when the identifier of the target feature is the same as the identifier of the default feature, it means that the target feature is the default feature. In this case, based on the predetermined extraction method of the default feature, the The target features are extracted from the documents to be processed.
  • the embodiment of the present application does not need to obtain the target feature extraction method from the third-party platform, but can realize the target feature extraction based on the predetermined default feature extraction method, which is easy to implement.
  • each feature in the target feature may be a default feature, or each feature in the target feature may not be a default feature, or, in the target feature, each feature may be a default feature.
  • Some of the target features are default features, and another part of the features are not default features; it can be seen that, regardless of whether the target features are default features, the embodiments of the present application provide corresponding feature extraction methods.
  • the third-party platform can write the identification of the non-default features into the configuration file, and send the configuration file and the corresponding file package to the electronic device; the electronic device can The package extracts new non-default features. That is to say, the third-party platform can determine the content of the configuration file and the content of the file package according to the extraction requirements of the target feature of the document to be processed. When the target feature to be extracted changes, it only needs to change the target feature in the configuration file. The logo and the contents of the package are sufficient.
  • most of the target features to be extracted may be default features; for different types of documents, new non-default features may need to be extracted, in this case, for different types of documents, the third-party platform can send different jar packages to the electronic device and determine the different contents of the configuration file. In this way, the electronic device can directly use the feature extraction method provided by the third-party platform to perform non-default features according to different jar packages. Compared with the solution in the related art, which needs to write and run new program codes locally on the electronic device, labor cost and time cost are saved.
  • FIG. 4 is another optional document processing method of the embodiment of the present application.
  • the flowchart, as shown in Figure 4, the main thread of the electronic device can be denoted as the thread epicDocCalculate, and the document processing method implemented based on the main thread of the electronic device can include:
  • Step 401 Read the configuration file and the file package.
  • the main thread of the electronic device can read the configuration file and the file package sent by the third-party platform.
  • Step 402 Determine whether the identifier of the target feature is the same as the identifier of the default feature. When the determination result is yes, step 403 is performed; when the determination result is no, step 404 is performed.
  • the main thread of the electronic device may determine, based on the configuration file, whether each target feature identifier of the document to be processed is the same as the default feature identifier.
  • Step 403 Extract default features.
  • the default feature extraction may be implemented based on a predetermined extraction manner of the default feature.
  • Step 404 Determine whether the file package and the class in the file package are correct, if both the file package and the class in the file package are correct, go to step 405; if the file package or the class in the file package is incorrect, return to step 401.
  • Step 405 Extract target features from the document to be processed based on the first information in the file package.
  • the target feature extraction can be achieved based on steps 401 to 405 .
  • the electronic device may also not receive the configuration file sent by the third-party platform after acquiring the document to be processed, but based on the predetermined extraction method of the default feature, Extract default features directly in the document to be processed.
  • the document to be processed may also be scored based on the target feature to obtain a quality score value of the document to be processed, so as to realize the quality assessment of the document to be processed.
  • the target feature includes at least two features; the profile includes weight information for each of the at least two features.
  • an implementation manner of performing a quality score on the document to be processed based on the target feature, and obtaining the quality score value of the document to be processed may include:
  • a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
  • the quality score value of the document to be processed can be calculated according to formula (1).
  • S represents the quality score value of the document to be processed
  • fi represents the ith feature
  • wi represents the weight of the ith feature in the above at least two features
  • n represents the number of features of the above at least two features.
  • the third-party platform may determine the weight of the target feature according to actual requirements, or may determine the target according to the initial weight of the target feature sent by the electronic device. feature weight.
  • the electronic device may pre-determine the initial weight of the target feature, and send the initial weight of the target feature to the third-party platform; the third-party platform may directly use the initial weight as the weight of the corresponding feature, or, the initial weight may be used in the initial weight. On the basis of , modify it to get the weight of the corresponding feature.
  • PublicKey represents the public key
  • ClassLocation represents the path of the jar package
  • ClassName represents the class name
  • featureName represents the feature weight
  • ExternalFeatureName represents the non-default feature name
  • ExternalFeatureWeight represents the non-default feature weight
  • A, B, C and D represent the features respectively A, feature B, feature C and feature D
  • feature A, feature B, feature C and feature D represent different default features
  • the weights of feature A, feature B, feature C and feature D are the initial weights determined by the electronic device
  • the weights of Feature A, Feature B, Feature C, and Feature D are 0.1, 0.2, 0.1, and 0.1, respectively.
  • D, E, and F represent feature D, feature E, and feature F.
  • Feature D, feature E, and feature F are all non-default features.
  • the weights of feature D, feature E, and feature F are 0.1, 0.2, and 0.2.
  • ClassName myAlgorithm The name of the custom class (encrypted) featureName [A1] Default feature extraction method featureWeight [0.4] initial weight Non-default feature name [A2,A3,A4] Non-default feature (encryption) Non-default feature weights [0.2,0.2,0.2] Non-default feature weights
  • each candidate weight combination when the default feature includes multiple features, multiple different candidate weight combinations may be predetermined for the default feature, each candidate weight combination includes a weight of each feature in the default feature, and each candidate weight The sum of the weights of each feature in the combination is equal to 1; a weight combination is selected from the above multiple candidate weight combinations as the initial weight of the default feature.
  • an implementation manner of selecting a weight combination from the above-mentioned multiple candidate weight combinations may be: obtaining a manual score value for a pre-acquired sample document; Perform a weighted sum operation on the score values of the sample document to obtain the quality score value of the sample document; in each candidate weight combination, select a candidate weight from the candidate weight combination that satisfies the set condition, and the set condition is: the manual score of the sample document The absolute value of the difference between the value and the quality score value is less than the set value.
  • a candidate weight whose manual rating value of the sample document is closest to the quality rating value may be selected from the candidate weight combinations that satisfy the set condition.
  • the default features include feature A5 and feature A6; for the weight of feature A5, based on a preset step of 0.05, traversing from 0.1 to 0.9, multiple weights of feature A5 are determined; for each of feature A5 The weight of the feature A6 is determined, thereby obtaining each candidate weight combination; the sum of the weights of each feature in each candidate weight combination is equal to 1.
  • Each candidate weight combination of feature A5 and feature A6 is shown in Table 3, and the same row of Table 3 represents one candidate weight combination.
  • the absolute value of the difference between the manual score value and the quality score value of the sample document can be determined for each candidate weight combination;
  • Table 4 shows the manual score value and quality score value of the sample document corresponding to each candidate weight combination.
  • one weight combination may be selected from the multiple candidate weight combinations as the initial weight of the default feature.
  • the electronic device may also determine the initial weights of the default feature and the non-default feature at the same time, and send the initial weights of the default feature and the non-default feature to a third-party platform; the third-party platform can directly use the initial weight of the default feature and non-default feature as the weight of the corresponding feature, or it can modify the initial weight of the default feature and non-default feature to obtain the weight of the corresponding feature .
  • the default feature includes feature B1, and the non-default feature is feature B2; for the weight of feature B1, based on a preset step of 0.05, traversing from 0.1 to 0.9, multiple weights of feature B1 are determined; for the feature For each weight of B1, the weight of feature B2 is determined to obtain each candidate weight combination; each candidate weight combination includes the weight of feature B1 and the weight of feature B2, and the weight of feature B1 and the weight of feature B2 in each candidate weight combination The sum of the weights is equal to 1.
  • the absolute value of the difference between the manual score value and the quality score value of the sample document can be determined for each candidate weight combination; when the sample document is an English document, the feature B1 When the number of words is represented, and the feature B2 represents the average length of the sentence, Table 5 shows the manual score value and quality score value of the sample document corresponding to each candidate weight combination.
  • one weight combination can be selected from the multiple candidate weight combinations as the initial weight of the default feature and the non-default feature.
  • the document to be processed is a Chinese document
  • the target features of the document to be processed include length-related features, template-related features, and part-of-speech-related features; wherein, the length-related features represent the number of words in the to-be-processed document, and the template-related features represent the difference between the to-be-processed document and the preset template.
  • the part-of-speech-related feature represents the ratio of the number of words of the preset part-of-speech to the number of all words in the document to be processed.
  • the preset part-of-speech includes verbs and nouns.
  • a plurality of different character count intervals may be predetermined, and each character count interval corresponds to a value. In this way, the value of the length-related feature can be obtained by processing the discretized data of the character count.
  • the value of the length-related feature can be determined according to Table 6.
  • Word count Values of length-dependent features Word count ⁇ 100 0
  • Apache POI can be used to extract content attribute data from the document to be processed and the preset template, and the content attribute data can include at least one of the following: main title, subtitle, body text, and summary. Title No. 1, Title No. 2, Title No. 3, Title No. 4, Title No. 5, etc; Convert to document feature vector.
  • the content attribute data of the preset template is: (title, title No. 1, body, summary), and the document feature vector of the preset template is [1, 1, 1, 1];
  • the document feature vector of the document to be processed is set to a vector of all zeros; when the content attribute data of the document to be processed contains title, No.
  • the document to be processed is document 1, and the content attribute data of document 1 is: (title, title No. 1, text, summary), then by comparing the Assuming the content attribute data of the template and document 1, it can be determined that the document feature vector of document 1 is [1, 1, 1, 1]; in the second example, the document to be processed is document 2, and the content attribute data of document 2 is : (title, title 3, title 4, title 5, text, summary), then by comparing the preset template with the content attribute data of document 2, the document feature vector of document 2 can be determined as [1,-1, -1,-1,1,1]; in the third example, the document to be processed is document 3, and the content attribute data of document 3 (title 3, title 4, title 5), it can be seen that the document The content attribute data of 3 is completely different from the content attribute data of the preset template.
  • the content attribute data of document 3 includes any one of the title, the first title, and the summary in the text. Therefore, it can be determined that the document feature vector
  • the similarity between the document to be processed and the preset template can be determined based on the document feature vectors of the document to be processed and the preset template, that is, the selection of the template-related features can be determined. value.
  • the similarity between the document to be processed and the preset template may be a cosine similarity
  • the calculation formula of the cosine similarity is the formula ( 2).
  • G and H represent the document feature vectors of the document to be processed and the preset template, respectively,
  • represents the length of the vector G,
  • represents the length of the vector H, and G ⁇ H represents the vector G and the vector H
  • cos( ⁇ ) represents the cosine similarity between the document to be processed and the preset template. It can be seen that cos( ⁇ ) represents the value of the template-related features.
  • the cosine similarity represents the cosine value of the angle between the two vectors.
  • the cosine similarity is large, it means that the vector G and the vector H are relatively similar; on the contrary, when the cosine similarity is small, it means that the vector G and the vector H exist. larger difference.
  • the cosine similarity between the document to be processed and the preset template is 1, that is, the similarity of the template-related features of the document to be processed is 1.
  • the value is 1; when the document to be processed is the above-mentioned document 1, according to formula (2), it can be determined that the cosine similarity between the document to be processed and the preset template is 1, that is, the value of the template-related features of the document to be processed is The value is 1.
  • the part-of-speech related features may be determined according to the proportion of nouns and verbs in the document to be processed in all words in the document to be processed; in some embodiments, the number of nouns and the number of verbs in the document to be processed is 20, The total number of words is 50, and the value of part-of-speech-related features is 0.6.
  • the word count of the document to be processed is greater than 2000
  • the document feature vector of the preset template is [1, 1, 1, 1]
  • the document feature vector of the document to be processed is [1, 1, 1, 1]
  • the ratio of nouns and verbs in the document to be processed to all words in the document to be processed is 0.6; it can be determined that the length-related features, template-related features and part-of-speech features of the to-be-processed document are 1, 1 and 0.6 respectively;
  • the quality score value of the document to be processed can be calculated according to formula (1), that is, the quality score value of the document to be processed is 0.84 ;
  • the quality score value of the document to be processed can also be multiplied by 100 to obtain the quality score value of the document to be processed under the percentile system.
  • the quality score of the document to be processed under the percentile system is 84.
  • the document to be processed is an English document, and the target features of the document to be processed include feature C1, feature C2, feature C3 and feature C4, wherein feature C1 is the default feature, indicating the number of words in the document to be processed; feature C2, feature C3 and feature C4
  • feature C2 represents the average sentence length of the document to be processed
  • feature C3 represents the number of document errors in the document to be processed
  • feature C4 represents the number of advanced vocabulary of the document to be processed
  • document errors include but are not limited to word spelling errors, Errors in the use of punctuation, the first letter of the first word of each sentence is not capitalized, etc.
  • Advanced vocabulary means vocabulary located in a predetermined advanced vocabulary. In practical applications, users can pre-determine advanced vocabulary according to the content of the document to be processed. surface.
  • a plurality of different word count intervals may be predetermined, and each word count interval corresponds to a value.
  • the value of the feature C1 can be obtained by processing the discretized data of the word count; for example, On the basis of Table 6, the number of words can be replaced by the number of words, and then a plurality of word count intervals and the values corresponding to each word count interval can be obtained.
  • the length of each sentence can be averaged to obtain the average sentence length; in order to determine the value corresponding to the average sentence length, a plurality of sentences can be predetermined. Length interval, each sentence length interval corresponds to a value. In this way, the value of feature C2 can be obtained by processing the discretized data of the average length of the sentence.
  • the value corresponding to the average sentence length can be obtained according to Table 7.
  • the number of document errors can be used as the independent variable of the exponential function, and the value of the dependent variable of the exponential function can be used as the value of the feature C3; here, the base of the exponential function is greater than 0 and less than 1. It is understandable that when the number of document errors is larger, the value of feature C3 is smaller.
  • the exponential function can be the following formula (3):
  • X represents the number of document errors
  • Y represents the value of feature C3
  • R ⁇ (0,1) for example, the value of R is 0.9.
  • a plurality of intervals of the number of advanced words may be predetermined, and each interval of the number of advanced words corresponds to a value.
  • the value of feature C4 can be obtained; in an example, when the number of advanced vocabulary is greater than or equal to 20, the value of feature C4 is 1.
  • the number of words in the document to be processed is 700, the average sentence length is 20, the number of document errors is 2, the number of advanced vocabulary is 20, the total number of sentences is 40, and the value of R is 0.9;
  • the values of feature C1, feature C2, feature C3 and feature C4 of the document are 0.4, 1, 0.81 and 1 respectively;
  • the weights of feature C1, feature C2, feature C3 and feature C4 are 0.4, 0.2, 0.2 and 0.2 respectively
  • the quality score value of the document to be processed can be calculated according to formula (1), that is, the quality score value of the document to be processed is 0.722; in some embodiments, the quality score value of the document to be processed can also be multiplied by 100, obtain the quality score value of the document to be processed under the percentile system, here, the quality score value of the document to be processed under the percentile system is 72.2.
  • the embodiments of the present application can be applied to any document management scenario.
  • the documents to be processed are pre-plan documents
  • using the document processing method of the embodiments of the present application firstly, based on the network communication structure shown in FIG. 1, the electronic device and the The communication of the third-party platform; then, the third-party platform can send the configuration file and file package to the electronic device, and the electronic device can extract the target feature according to the configuration file and file package and adopt NLP and other technologies; finally, based on the extracted The target feature can realize the evaluation and audit of the quality of the plan document, which is beneficial to further optimize the plan document.
  • FIG. 5 is a schematic diagram of an optional composition structure of the document processing apparatus according to the embodiment of the present application, as shown in FIG. 5 .
  • the document processing apparatus 500 may include:
  • the first obtaining module 501 is configured to obtain documents to be processed
  • the receiving module 502 is configured to receive a configuration file sent by a third-party platform, where the configuration file includes an identifier of a target feature of a document to be processed and path information of a file package provided by the third-party platform; The first information of the feature extraction method of the target feature;
  • the second obtaining module 503 is configured to obtain the file package based on the path information of the file package when the identifier of the target feature is different from the identifier of the default feature;
  • the processing module 504 is configured to extract the target feature from the document to be processed based on the first information in the file package.
  • the file package includes a custom class, and the first information is located in the custom class;
  • the second obtaining module 503 is further configured to load the custom class in the file package through the reflection mechanism of the programming language, and obtain the first information from the loaded custom class.
  • the configuration file further includes second information, where the second information includes: an identifier of the file package and/or an identifier of the custom class;
  • the second obtaining module 503 is configured to load the custom class in the file package through the reflection mechanism of the programming language, including:
  • the custom class in the file package is loaded through the reflection mechanism of the programming language.
  • the second obtaining module 503 is further configured to obtain a preset encryption mode of the second information; based on the decryption mode corresponding to the encryption mode of the second information, The encrypted information in the configuration file is decrypted to obtain the second information; wherein the encrypted information is obtained by encrypting the second information based on the encryption method.
  • the second obtaining module 503 is further configured to predetermine an abstract class, and set the custom class to inherit the predetermined abstract class;
  • the second obtaining module 503 is configured to obtain the first information from the loaded custom class, including:
  • the custom class is instantiated as an object, and when the object belongs to the abstract class, the first information is obtained from the loaded custom class.
  • the processing module 504 is further configured to, in the case that the identifier of the target feature is the same as the identifier of the default feature, based on the predetermined extraction method of the default feature, in the The target feature is extracted from the document to be processed.
  • the processing module 504 is further configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed.
  • the target feature includes at least two features; the configuration file includes weight information of each of the at least two features;
  • the processing module 504 is configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed, including:
  • a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
  • the processing module 504 is configured to extract the target feature from the document to be processed, including:
  • the word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain length-related features, each of which corresponds to a value; the document feature vector of the to-be-processed document is extracted, and the The cosine similarity between the document feature vector of the document to be processed and the document feature vector of the preset template is used as the template correlation feature; the part of speech is determined according to the number ratio of the preset part of speech in the document to be processed to all words in the document to be processed. relevant features;
  • At least two of length-related features, template-related features, and part-of-speech-related features are used as the target features.
  • the processing module 504 is configured to extract the target feature from the document to be processed, including:
  • the word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain a first feature, and each of the word count intervals corresponds to a value; the sentences of the document to be processed are averaged
  • the length is discretized data processing according to a plurality of predetermined sentence length intervals to obtain the second feature, and each sentence length interval corresponds to a value;
  • the document error number of the document to be processed is used as the independent variable of the exponential function, Obtain the value of the exponential function, and use the value of the exponential function as the third feature; perform discretization data processing on the number of advanced words of the document to be processed according to a plurality of predetermined intervals of the number of advanced words , to obtain the fourth feature, each of the high-level vocabulary count intervals corresponds to a value, and the high-level vocabulary represents a vocabulary located in a predetermined high-level vocabulary;
  • At least two of the first feature, the second feature, the third feature, and the fourth feature are used as the target feature.
  • the first acquisition module 501, the receiving module 502, the second acquisition module 503, and the processing module 504 can all be implemented by processors, and the above processors can be ASIC, DSP, DSPD, PLD, FPGA, CPU, controller , at least one of a microcontroller and a microprocessor. It can be understood that the electronic device that implements the function of the above processor may also be other, which is not limited in the embodiment of the present application.
  • the above-mentioned document processing method is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present application can be embodied in the form of software products in essence or in the parts that make contributions to the prior art.
  • the computer software products are stored in a storage medium and include several instructions for A computer device (which may be a terminal, a server, etc.) is caused to execute all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: a U disk, a mobile hard disk, a read only memory (Read Only Memory, ROM), a magnetic disk or an optical disk and other media that can store program codes.
  • ROM Read Only Memory
  • the aforementioned storage medium includes: a U disk, a mobile hard disk, a read only memory (Read Only Memory, ROM), a magnetic disk or an optical disk and other media that can store program codes.
  • the embodiments of the present application are not limited to any specific combination of hardware and software.
  • the embodiments of the present application further provide a computer program product, where the computer program product includes computer-executable instructions, and the computer-executable instructions are used to implement any one of the document processing methods provided by the embodiments of the present application.
  • an embodiment of the present application further provides a computer storage medium, where computer-executable instructions are stored on the computer storage medium, and the computer-executable instructions are used to implement any one of the document processing methods provided in the foregoing embodiments.
  • FIG. 6 is an optional structural schematic diagram of the electronic device provided by the embodiment of the present application.
  • the electronic device 60 includes:
  • memory 601 configured to store executable instructions
  • the processor 602 is configured to implement any one of the above document processing methods when executing the executable instructions stored in the memory 601 .
  • the above-mentioned processor 602 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor.
  • the above-mentioned computer-readable storage medium/memory can be a read-only memory (Read Only Memory, ROM), a programmable read-only memory (Programmable Read-Only Memory, PROM), an erasable programmable read-only memory (Erasable Programmable Read-Only Memory) Memory, EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Magnetic Random Access Memory (FRAM), Flash Memory (Flash Memory), Magnetic Surface Memory, optical disk, or memory such as Compact Disc Read-Only Memory (CD-ROM); it can also be various terminals including one or any combination of the above memories, such as mobile phones, computers, tablet devices, personal digital Assistant etc.
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms. of.
  • the unit described above as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit; it may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may all be integrated into one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; the above integration
  • the unit can be implemented either in the form of hardware or in the form of hardware plus software functional units.
  • the above-mentioned integrated units of the present application are implemented in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present application may be embodied in the form of software products in essence or the parts that make contributions to related technologies.
  • the computer software products are stored in a storage medium and include several instructions to make
  • the automatic test line of the device performs all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes various media that can store program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.
  • Embodiments of the present application provide a document processing method, apparatus, device, and computer-readable storage medium; the method includes: acquiring a document to be processed; receiving a configuration file sent by a third-party platform, where the configuration file includes a target of the document to be processed The identifier of the feature and the path information of the file package provided by the third-party platform; the file package includes the first information representing the feature extraction method of the target feature; if the identifier of the target feature is different from that of the default feature In this case, the file package is acquired based on the path information of the file package; and the target feature is extracted from the to-be-processed document based on the first information in the file package.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A document processing method and apparatus, a device, and a computer-readable storage medium. The method comprises: obtaining a document to be processed (201); receiving a configuration file sent by a third-party platform (202), the configuration file comprising an identifier of a target feature of said document and path information of a file package provided by the third-party platform, and the file package comprising first information representing a feature extraction method of the target feature; under the condition that the identifier of the target feature is different from an identifier of a default feature, obtaining the file package on the basis of the path information of the file package (203); and extracting the target feature from said document on the basis of the first information in the file package (204).

Description

文档处理方法、装置、电子设备、存储介质和程序Document processing method, apparatus, electronic device, storage medium and program
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请基于申请号为202010884957.2、申请日为2020年8月28日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on the Chinese patent application with the application number of 202010884957.2 and the filing date of August 28, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference.
技术领域technical field
本申请涉及金融科技(Fintech)的文档管理领域,涉及但不限于一种文档处理方法、装置、电子设备、计算机可读存储介质和计算机程序。The present application relates to the field of document management of financial technology (Fintech), and relates to, but is not limited to, a document processing method, apparatus, electronic device, computer-readable storage medium and computer program.
背景技术Background technique
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技(Fintech)转变,但由于金融行业的安全性、实时性要求,也对技术提出了更高的要求。With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually transforming into financial technology (Fintech). requirements.
目前,在金融科技领域中,为了便于进行文档管理,需要提取文档的特征,并基于文档的特征进行文档管理;然而,在文档的特征不是默认特征,而是新特征时,需要编写并运行的新的程序代码,以实现文档的新特征的提取,导致增加了时间成本和人力成本。At present, in the field of fintech, in order to facilitate document management, it is necessary to extract the features of documents and perform document management based on the features of documents; however, when the features of documents are not default features but new features, it is necessary to write and run the The new program code to realize the extraction of new features of the document leads to increased time cost and labor cost.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种文档处理方法、装置、电子设备、计算机可读存储介质和计算机程序。Embodiments of the present application provide a document processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program.
本申请实施例的技术方案是这样实现的:The technical solutions of the embodiments of the present application are implemented as follows:
本申请实施例提供一种文档处理方法,所述方法包括:The embodiment of the present application provides a document processing method, the method includes:
获取待处理文档;Get pending documents;
接收第三方平台发送的配置文件,所述配置文件包括待处理文档的目标特征的标识和所述第三方平台提供的文件包的路径信息;所述文件包包括表征所述目标特征的特征提取方法的第一信息;Receive a configuration file sent by a third-party platform, where the configuration file includes the identifier of the target feature of the document to be processed and the path information of the file package provided by the third-party platform; the file package includes a feature extraction method that characterizes the target feature first information;
在所述目标特征的标识与默认特征的标识不同的情况下,基于所述文件包的路径信息获取所述文件包;When the identifier of the target feature is different from the identifier of the default feature, acquiring the file package based on the path information of the file package;
基于所述文件包中的所述第一信息,在所述待处理文档中提取出所述目标特征。Based on the first information in the file package, the target feature is extracted from the document to be processed.
在本申请的一些实施例中,所述文件包包括自定义类,所述第一信息位于自定义类中;In some embodiments of the present application, the file package includes a custom class, and the first information is located in the custom class;
所述方法还包括:通过程序语言的反射机制,加载所述文件包中的所述自定义类,并从加载的所述自定义类中获取所述第一信息。The method further includes: loading the custom class in the file package through a reflection mechanism of a programming language, and acquiring the first information from the loaded custom class.
可以看出,在本申请实施例中,可以通过程序语言的反射机制加载文件包中的自定义类,也就是说,无论文件包中的自定义类是已知的还是未知的,均可以基于程序语言的反射机制的原理,不需要提前引入文件包中的自定义类,可以实现文件包中的自定义类的加载;在实时接收文件包的情况下,可以实现文件包中的自定义类的动态加载。It can be seen that, in the embodiment of the present application, the custom class in the file package can be loaded through the reflection mechanism of the programming language, that is, regardless of whether the custom class in the file package is known or unknown, it can be based on The principle of the reflection mechanism of the programming language does not need to introduce the custom class in the file package in advance, and can realize the loading of the custom class in the file package; in the case of receiving the file package in real time, the custom class in the file package can be realized dynamic loading.
在本申请的一些实施例中,所述配置文件还包括第二信息,所述第二信息包括:所述文件包的标识和/或所述自定义类的标识;In some embodiments of the present application, the configuration file further includes second information, where the second information includes: an identifier of the file package and/or an identifier of the custom class;
所述通过程序语言的反射机制,加载所述文件包中的所述自定义类,包括:The loading of the custom class in the file package through the reflection mechanism of the programming language includes:
在确定所述配置文件中的第二信息为预先与所述第三方平台约定的信息的情况下,通过所述程序语言的反射机制,加载所述文件包中的所述自定义类。In the case where it is determined that the second information in the configuration file is information pre-agreed with the third-party platform, the custom class in the file package is loaded through the reflection mechanism of the programming language.
可以看出,在配置文件中的第二信息为预先与第三方平台约定的信息的情况下,说 明配置文件中的第二信息是正确的信息,在此基础上,加载文件包中的自定义类,有利于准确地从自定义类中获取第一信息,进而,有利于准确地提取目标特征。It can be seen that if the second information in the configuration file is the information agreed with the third-party platform in advance, it means that the second information in the configuration file is correct information. The class is beneficial to accurately obtain the first information from the custom class, and further, it is beneficial to accurately extract the target feature.
在本申请的一些实施例中,所述方法还包括:In some embodiments of the present application, the method further includes:
获取预先设置的所述第二信息的加密方式;obtaining a preset encryption method of the second information;
基于所述第二信息的加密方式对应的解密方式,对所述配置文件中的加密信息进行解密,得到所述第二信息;其中,所述加密信息是基于所述加密方式对所述第二信息进行加密得到的。Decrypt the encrypted information in the configuration file based on the decryption method corresponding to the encryption method of the second information to obtain the second information; wherein the encrypted information is based on the encryption method to the second information. information is encrypted.
可以看出,本申请实施例可以在接收到第三方平台发送的配置文件后,基于预先设置的加密方式对应的解密方式进行解密,因而,可以实现第二信息的加密传输,有利于提高第二信息的安全性,降低第二信息被攻击的风险。It can be seen that in the embodiment of the present application, after receiving the configuration file sent by the third-party platform, decryption can be performed based on the decryption method corresponding to the preset encryption method. Therefore, the encrypted transmission of the second information can be realized, which is beneficial to improve the second information. Information security, reducing the risk of second information being attacked.
在本申请的一些实施例中,所述文档处理方法还包括:In some embodiments of the present application, the document processing method further includes:
预先确定抽象类,设置所述自定义类继承所述预先确定的抽象类;Predetermining an abstract class, and setting the custom class to inherit the predetermined abstract class;
所述从加载的所述自定义类中获取所述第一信息,包括:The obtaining the first information from the loaded custom class includes:
将所述自定义类实例化为对象,在所述对象属于所述抽象类的情况下,从加载的所述自定义类中获取所述第一信息。The custom class is instantiated as an object, and when the object belongs to the abstract class, the first information is obtained from the loaded custom class.
可以看出,本申请实施例中,在自定义类实例化的对象属于抽象类的情况下,可以认为自定义类为正确的类,在此基础上,有利于准确地从自定义类中获取第一信息,进而,有利于准确地提取目标特征。It can be seen that in the embodiment of the present application, when the object instantiated by the custom class belongs to an abstract class, it can be considered that the custom class is the correct class, and on this basis, it is beneficial to accurately obtain from the custom class The first information, in turn, facilitates accurate extraction of target features.
在本申请的一些实施例中,所述方法还包括:In some embodiments of the present application, the method further includes:
在所述目标特征的标识与默认特征的标识相同的情况下,基于预先确定的所述默认特征的提取方式,在所述待处理文档中提取出所述目标特征。In the case that the identifier of the target feature is the same as the identifier of the default feature, the target feature is extracted from the document to be processed based on a predetermined extraction method of the default feature.
可以看出,本申请实施例对于目标特征为默认特征的情况,无需从第三方平台获取目标特征的提取方式,而是可以基于预先确定的默认特征的提取方式实现目标特征提取,具有易于实现的特点。It can be seen that, for the case where the target feature is the default feature, the embodiment of the present application does not need to obtain the target feature extraction method from the third-party platform, but can realize the target feature extraction based on the predetermined default feature extraction method, which is easy to implement. Features.
在本申请的一些实施例中,所述方法还包括:In some embodiments of the present application, the method further includes:
基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值。A quality score is performed on the document to be processed based on the target feature, and a quality score value of the document to be processed is obtained.
可以看出,本申请实施例可以在目标特征的基础上,实现对待处理文档的质量评估,有利于在对待处理文档进行质量评估的基础上实现对待处理文档的管理。It can be seen that the embodiments of the present application can implement the quality assessment of the document to be processed on the basis of the target feature, which is beneficial to realize the management of the document to be processed on the basis of the quality assessment of the document to be processed.
在本申请的一些实施例中,所述目标特征包括至少两个特征;所述配置文件包括所述至少两个特征中每个特征的权重信息;In some embodiments of the present application, the target feature includes at least two features; the configuration file includes weight information of each of the at least two features;
所述基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值,包括:Performing a quality score on the document to be processed based on the target feature to obtain a quality score value of the document to be processed, including:
基于所述至少两个特征中各个特征的权重信息,对所述至少两个特征中各个特征进行加权求和运算,得出所述待处理文档的质量评分值。Based on the weight information of each of the at least two features, a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
可以看出,本申请实施例可以通过对目标特征的各个特征进行加权求和,实现对待处理文档的质量评估,有利于在对待处理文档进行质量评估的基础上实现对待处理文档的管理。It can be seen that the embodiment of the present application can implement the quality assessment of the document to be processed by performing weighted summation of each feature of the target feature, which is beneficial to realize the management of the document to be processed based on the quality assessment of the document to be processed.
在本申请的一些实施例中,所述在所述待处理文档中提取出所述目标特征,包括:In some embodiments of the present application, the extracting the target feature from the document to be processed includes:
将所述待处理文档的字数按照预先确定的多个字数区间进行离散化数据处理,得到长度相关特征,每个所述字数区间对应一个取值;提取所述待处理文档的文档特征向量,将所述待处理文档的文档特征向量与预设模板的文档特征向量的余弦相似度作为模板相关特征;根据所述待处理文档中预设词性的词占待处理文档所有词的数量比例,确定词性相关特征;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain length-related features, each of which corresponds to a value; the document feature vector of the to-be-processed document is extracted, and the The cosine similarity between the document feature vector of the document to be processed and the document feature vector of the preset template is used as the template correlation feature; the part of speech is determined according to the number ratio of the preset part of speech in the document to be processed to all words in the document to be processed. relevant features;
将长度相关特征、模板相关特征和词性相关特征中的至少两个作为所述目标特征。At least two of length-related features, template-related features, and part-of-speech-related features are used as the target features.
可以看出,本申请实施例可以基于长度相关特征、模板相关特征和词性相关特征,实现对待处理文档的质量评估,即,可以从多个方面准确地评估待处理文档的质量。It can be seen that the embodiments of the present application can implement the quality evaluation of the document to be processed based on the length-related features, template-related features, and part-of-speech features, that is, the quality of the to-be-processed document can be accurately evaluated from multiple aspects.
在本申请的一些实施例中,所述在所述待处理文档中提取出所述目标特征,包括:In some embodiments of the present application, the extracting the target feature from the document to be processed includes:
将所述待处理文档的单词数按照预先确定的多个单词数区间进行离散化数据处理,得到第一特征,每个所述单词数区间对应一个取值;将所述待处理文档的句子平均长度按照预先确定的多个句子长度区间进行离散化数据处理,得到第二特征,每个所述句子长度区间对应一个取值;以所述待处理文档的文档错误数作为指数函数的自变量,得出所述指数函数的取值,将所述指数函数的取值作为所述第三特征;将所述待处理文档的高级词汇数按照预先确定的多个高级词汇数区间进行离散化数据处理,得到第四特征,每个所述高级词汇数区间对应一个取值,所述高级词汇表示位于预先确定的高级词汇表中的词汇;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain a first feature, and each of the word count intervals corresponds to a value; the sentences of the document to be processed are averaged The length is discretized data processing according to a plurality of predetermined sentence length intervals to obtain the second feature, and each sentence length interval corresponds to a value; the document error number of the document to be processed is used as the independent variable of the exponential function, Obtain the value of the exponential function, and use the value of the exponential function as the third feature; perform discretization data processing on the number of advanced words of the document to be processed according to a plurality of predetermined intervals of the number of advanced words , to obtain the fourth feature, each of the high-level vocabulary count intervals corresponds to a value, and the high-level vocabulary represents a vocabulary located in a predetermined high-level vocabulary;
将所述第一特征、第二特征、第三特征和第四特征中的至少两个作为所述目标特征。At least two of the first feature, the second feature, the third feature, and the fourth feature are used as the target feature.
可以看出,本申请实施例可以基于第一特征、第二特征、第三特征和第四特征,实现对待处理文档的质量评估,而第一特征、第二特征、第三特征和第四特征为四个不同的特征,因而,本申请实施例可以从多个方面准确地评估待处理文档的质量。It can be seen that the embodiment of the present application can implement the quality evaluation of the document to be processed based on the first feature, the second feature, the third feature and the fourth feature, while the first feature, the second feature, the third feature and the fourth feature There are four different characteristics. Therefore, the embodiments of the present application can accurately evaluate the quality of the document to be processed from various aspects.
本申请实施例提供一种文档处理装置,所述装置包括:An embodiment of the present application provides a document processing device, and the device includes:
第一获取模块,配置为获取待处理文档;The first obtaining module is configured to obtain the document to be processed;
接收模块,配置为接收第三方平台发送的配置文件,所述配置文件包括待处理文档的目标特征的标识和所述第三方平台提供的文件包的路径信息;所述文件包包括表征所述目标特征的特征提取方法的第一信息;a receiving module, configured to receive a configuration file sent by a third-party platform, the configuration file includes an identifier of a target feature of the document to be processed and path information of a file package provided by the third-party platform; the file package includes a file representing the target the first information of the feature extraction method of the feature;
第二获取模块,配置为在所述目标特征的标识与默认特征的标识不同的情况下,基于所述文件包的路径信息获取所述文件包;a second acquiring module, configured to acquire the file package based on the path information of the file package when the identifier of the target feature is different from the identifier of the default feature;
处理模块,配置为基于所述文件包中的所述第一信息,在所述待处理文档中提取出所述目标特征。A processing module, configured to extract the target feature from the document to be processed based on the first information in the file package.
在本申请的一些实施例中,所述文件包包括自定义类,所述第一信息位于自定义类中;In some embodiments of the present application, the file package includes a custom class, and the first information is located in the custom class;
所述第二获取模块,还配置为通过程序语言的反射机制,加载所述文件包中的所述自定义类,并从加载的所述自定义类中获取所述第一信息。The second obtaining module is further configured to load the custom class in the file package through the reflection mechanism of the programming language, and obtain the first information from the loaded custom class.
在本申请的一些实施例中,所述配置文件还包括第二信息,所述第二信息包括:所述文件包的标识和/或所述自定义类的标识;In some embodiments of the present application, the configuration file further includes second information, where the second information includes: an identifier of the file package and/or an identifier of the custom class;
所述第二获取模块,配置为通过程序语言的反射机制,加载所述文件包中的所述自定义类,包括:The second acquisition module is configured to load the custom class in the file package through the reflection mechanism of the programming language, including:
在确定所述配置文件中的第二信息为预先与所述第三方平台约定的信息的情况下,通过所述程序语言的反射机制,加载所述文件包中的所述自定义类。In the case where it is determined that the second information in the configuration file is information pre-agreed with the third-party platform, the custom class in the file package is loaded through the reflection mechanism of the programming language.
在本申请的一些实施例中,所述第二获取模块,还配置为获取预先设置的所述第二信息的加密方式;基于所述第二信息的加密方式对应的解密方式,对所述配置文件中的加密信息进行解密,得到所述第二信息;其中,所述加密信息是基于所述加密方式对所述第二信息进行加密得到的。In some embodiments of the present application, the second obtaining module is further configured to obtain a preset encryption method of the second information; based on the decryption method corresponding to the encryption method of the second information, the configuration The encrypted information in the file is decrypted to obtain the second information; wherein the encrypted information is obtained by encrypting the second information based on the encryption method.
在本申请的一些实施例中,所述第二获取模块,还配置为预先确定抽象类,设置所述自定义类继承所述预先确定的抽象类;In some embodiments of the present application, the second obtaining module is further configured to predetermine an abstract class, and set the custom class to inherit the predetermined abstract class;
所述第二获取模块,配置为从加载的所述自定义类中获取所述第一信息,包括:The second obtaining module is configured to obtain the first information from the loaded custom class, including:
将所述自定义类实例化为对象,在所述对象属于所述抽象类的情况下,从加载的所述自定义类中获取所述第一信息。The custom class is instantiated as an object, and when the object belongs to the abstract class, the first information is obtained from the loaded custom class.
在本申请的一些实施例中,所述处理模块,还配置为在所述目标特征的标识与默认特征的标识相同的情况下,基于预先确定的所述默认特征的提取方式,在所述待处理文 档中提取出所述目标特征。In some embodiments of the present application, the processing module is further configured to, in the case that the identifier of the target feature is the same as the identifier of the default feature, based on the predetermined extraction method of the default feature, in the The target feature is extracted from the processed document.
在本申请的一些实施例中,所述处理模块,还配置为基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值。In some embodiments of the present application, the processing module is further configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed.
在本申请的一些实施例中,所述目标特征包括至少两个特征;所述配置文件包括所述至少两个特征中每个特征的权重信息;In some embodiments of the present application, the target feature includes at least two features; the configuration file includes weight information of each of the at least two features;
所述处理模块,配置为基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值,包括:The processing module is configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed, including:
基于所述至少两个特征中各个特征的权重信息,对所述至少两个特征中各个特征进行加权求和运算,得出所述待处理文档的质量评分值。Based on the weight information of each of the at least two features, a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
在本申请的一些实施例中,所述处理模块,配置为在所述待处理文档中提取出所述目标特征,包括:In some embodiments of the present application, the processing module, configured to extract the target feature from the document to be processed, includes:
将所述待处理文档的字数按照预先确定的多个字数区间进行离散化数据处理,得到长度相关特征,每个所述字数区间对应一个取值;提取所述待处理文档的文档特征向量,将所述待处理文档的文档特征向量与预设模板的文档特征向量的余弦相似度作为模板相关特征;根据所述待处理文档中预设词性的词占待处理文档所有词的数量比例,确定词性相关特征;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain length-related features, each of which corresponds to a value; the document feature vector of the to-be-processed document is extracted, and the The cosine similarity between the document feature vector of the document to be processed and the document feature vector of the preset template is used as the template correlation feature; the part of speech is determined according to the number ratio of the preset part of speech in the document to be processed to all words in the document to be processed. relevant features;
将长度相关特征、模板相关特征和词性相关特征中的至少两个作为所述目标特征。At least two of length-related features, template-related features, and part-of-speech-related features are used as the target features.
在本申请的一些实施例中,所述处理模块,配置为在所述待处理文档中提取出所述目标特征,包括:In some embodiments of the present application, the processing module, configured to extract the target feature from the document to be processed, includes:
将所述待处理文档的单词数按照预先确定的多个单词数区间进行离散化数据处理,得到第一特征,每个所述单词数区间对应一个取值;将所述待处理文档的句子平均长度按照预先确定的多个句子长度区间进行离散化数据处理,得到第二特征,每个所述句子长度区间对应一个取值;以所述待处理文档的文档错误数作为指数函数的自变量,得出所述指数函数的取值,将所述指数函数的取值作为所述第三特征;将所述待处理文档的高级词汇数按照预先确定的多个高级词汇数区间进行离散化数据处理,得到第四特征,每个所述高级词汇数区间对应一个取值,所述高级词汇表示位于预先确定的高级词汇表中的词汇;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain a first feature, and each of the word count intervals corresponds to a value; the sentences of the document to be processed are averaged The length is discretized data processing according to a plurality of predetermined sentence length intervals to obtain the second feature, and each sentence length interval corresponds to a value; the document error number of the document to be processed is used as the independent variable of the exponential function, Obtain the value of the exponential function, and use the value of the exponential function as the third feature; perform discretization data processing on the number of advanced words of the document to be processed according to a plurality of predetermined intervals of the number of advanced words , to obtain the fourth feature, each of the high-level vocabulary count intervals corresponds to a value, and the high-level vocabulary represents a vocabulary located in a predetermined high-level vocabulary;
将所述第一特征、第二特征、第三特征和第四特征中的至少两个作为所述目标特征。At least two of the first feature, the second feature, the third feature, and the fourth feature are used as the target feature.
本申请实施例提供一种电子设备,所述电子设备包括:An embodiment of the present application provides an electronic device, and the electronic device includes:
存储器,配置为存储可执行指令;a memory configured to store executable instructions;
处理器,配置为执行所述存储器中存储的可执行指令时,实现上述任意一种文档处理方法。The processor is configured to implement any one of the above document processing methods when executing the executable instructions stored in the memory.
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于被处理器执行时,实现上述任意一种文档处理方法。Embodiments of the present application provide a computer-readable storage medium storing executable instructions for implementing any one of the foregoing document processing methods when executed by a processor.
本申请实施例提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现上述任意一种文档处理方法。An embodiment of the present application provides a computer program, including computer-readable code, when the computer-readable code is executed in an electronic device, the processor in the electronic device executes any one of the above document processing methods.
本申请实施例中,获取待处理文档;接收第三方平台发送的配置文件,所述配置文件包括待处理文档的目标特征的标识和所述第三方平台提供的文件包的路径信息;所述文件包包括表征所述目标特征的特征提取方法的第一信息;在所述目标特征的标识与默认特征的标识不同的情况下,基于所述文件包的路径信息获取所述文件包;基于所述文件包中的所述第一信息,在所述待处理文档中提取出所述目标特征。可以看出,在本申请实施例中,在需要提取待处理文档的目标特征且目标特征不是默认特征的情况下,为了实现目标特征提取,不需要在本地编写并运行的新的程序代码,而是可以直接从第三方平台获取目标特征的提取方法,在一定程度上降低了时间成本和人力成本。In this embodiment of the present application, a document to be processed is obtained; a configuration file sent by a third-party platform is received, where the configuration file includes an identifier of a target feature of the document to be processed and path information of a file package provided by the third-party platform; the file The package includes first information that characterizes the feature extraction method of the target feature; when the identifier of the target feature is different from the identifier of the default feature, the file package is acquired based on the path information of the file package; based on the The first information in the file package extracts the target feature from the document to be processed. It can be seen that, in the embodiment of the present application, in the case where the target feature of the document to be processed needs to be extracted and the target feature is not the default feature, in order to achieve the target feature extraction, no new program code written and run locally is not required, but It is an extraction method that can directly obtain target features from third-party platforms, which reduces time and labor costs to a certain extent.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the technical solutions of the present disclosure.
图1是本申请实施例的一个应用场景的示意图;1 is a schematic diagram of an application scenario of an embodiment of the present application;
图2是本申请实施例提供的文档处理方法的一个可选的流程图;Fig. 2 is an optional flowchart of the document processing method provided by the embodiment of the present application;
图3是本申请实施例中实现配置文件中的信息加密传输的一个流程图;Fig. 3 is a flow chart of realizing the encrypted transmission of information in the configuration file in the embodiment of the present application;
图4是本申请实施例提供的文档处理方法的另一个可选的流程图;Fig. 4 is another optional flowchart of the document processing method provided by the embodiment of the present application;
图5是本申请实施例的文档处理装置的一个可选的组成结构示意图;5 is a schematic diagram of an optional composition structure of a document processing apparatus according to an embodiment of the present application;
图6是本申请实施例提供的电子设备的一个可选的组成结构示意图。FIG. 6 is a schematic diagram of an optional composition structure of an electronic device provided by an embodiment of the present application.
具体实施方式detailed description
在相关技术中,对于预案文档的管理,仅仅可以采用与图书馆文档管理系统类似的方案,实现文档的上传和下载,这种文档管理方式无法实现对文档质量的评估;并且,采用这种文档管理方式,可以随意向文档库上传文档,可能导致文档库的文档质量良莠不齐,随着个人、企业、社会的发展,文档库中的文档会越来越多。In the related art, for the management of pre-plan documents, only a scheme similar to the library document management system can be used to realize the upload and download of documents, and this kind of document management method cannot realize the evaluation of document quality; In the management mode, you can upload documents to the document library at will, which may cause the quality of the documents in the document library to be uneven. With the development of individuals, enterprises, and society, there will be more and more documents in the document library.
在相关技术中,可以采用人工评估文档质量的方式实现文档管理,然而,这样会增加大量的人力成本,每个人的文档评估准则不能作为经验保存,人工评估文档质量的方式还存在主观性强和不够客观的问题;在相关技术中,也可以基于特征工程提取特定的某种类型的文档的特征,然后,基于提取的特征进行文档质量的评估,例如,特定的某种类型的文档可以是英语作文、中文作文等;针对不同类型的文档,可能需要提取不同类型的特征,因而,为了提取不同类型的特征,需要开发并部署不同的特征提取模型,或者需要开发不同的特征库,为了部署不同的特征提取模型,需要在本地编写并部署新的程序代码,这样导致增加了时间成本和人力成本。In related technologies, document management can be achieved by manually evaluating document quality. However, this will increase a lot of labor costs, and each person's document evaluation criteria cannot be saved as experience. The method of manually evaluating document quality is still subject to strong subjectivity and inconvenience. The problem is not objective enough; in the related art, the features of a certain type of document can also be extracted based on feature engineering, and then the document quality can be evaluated based on the extracted features. For example, a certain type of document can be in English. Composition, Chinese composition, etc.; for different types of documents, different types of features may need to be extracted. Therefore, in order to extract different types of features, different feature extraction models need to be developed and deployed, or different feature libraries need to be developed. For the feature extraction model, new program code needs to be written and deployed locally, which increases time and labor costs.
针对上述技术问题,提出本申请实施例的技术方案。In view of the above technical problems, the technical solutions of the embodiments of the present application are proposed.
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail below with reference to the accompanying drawings. All other embodiments obtained under the premise of creative work fall within the scope of protection of the present application.
本申请实施例提供一种文档处理方法、装置、设备及计算机可读存储介质;本申请实施例提供的文档处理方法可以应用于电子设备中,下面说明本申请实施例提供的电子设备的示例性应用,本申请实施例提供的电子设备可以实施为笔记本电脑,平板电脑,台式计算机,机顶盒,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)等。The embodiments of the present application provide a document processing method, apparatus, device, and computer-readable storage medium; the document processing methods provided by the embodiments of the present application can be applied to electronic devices, and exemplary electronic devices provided by the embodiments of the present application are described below. application, the electronic device provided by the embodiments of the present application can be implemented as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (eg, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), etc. .
图1为本申请实施例的一个应用场景的示意图,如图1所示,电子设备100可以通过网络101连接第三方平台102;网络101可以是广域网或者局域网,又或者是二者的组合;第三方平台102可以基于终端和/或服务器实现,终端可以是平板电脑、笔记本电脑、台式计算机等,但并不局限于此;服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。FIG. 1 is a schematic diagram of an application scenario of an embodiment of the present application. As shown in FIG. 1 , the electronic device 100 may connect to the third-party platform 102 through the network 101; the network 101 may be a wide area network or a local area network, or a combination of the two; The tripartite platform 102 can be implemented based on a terminal and/or a server, and the terminal can be a tablet computer, a notebook computer, a desktop computer, etc., but is not limited to this; the server can be an independent physical server, or a server composed of multiple physical servers A cluster or distributed system can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, and Content Delivery Network (CDN). ), as well as cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
在本申请的一些实施例中,第三方平台102可以获取待处理文档,并将待处理文档发送至电子设备100;待处理文档的类型可以是任意类型,在一些实施例中,待处理文档可以是中文文档、英文文档或其它语言的文档;在一些实施例中,待处理文档可以是预案文档、电子设备的日志数据或其它文档;需要说明的是,上述记载的内容仅仅是对待处理文档的类型进行示例性说明,本申请实施例并不局限于此。In some embodiments of the present application, the third-party platform 102 may acquire the document to be processed and send the document to be processed to the electronic device 100; the type of the document to be processed may be any type, and in some embodiments, the document to be processed may be It is a Chinese document, an English document or a document in other languages; in some embodiments, the document to be processed may be a plan document, log data of an electronic device, or other documents; it should be noted that the content recorded above is only for the document to be processed. Types are exemplified, and the embodiments of the present application are not limited thereto.
在本申请的一些实施例中,电子设备100可以从本地获取待处理文档,或者,从网络101中下载待处理文档;电子设备100可以将待处理文档发送至第三方平台102。In some embodiments of the present application, the electronic device 100 may obtain the document to be processed locally, or download the document to be processed from the network 101 ; the electronic device 100 may send the document to be processed to the third-party platform 102 .
第三方平台102在获取待处理文档后,可以确定待处理文档的目标特征和目标特征的特征提取方法,并生成配置文件,配置文件至少包括待处理文档的目标特征的标识和第三方平台102提供的文件包的路径信息;文件包包括表征目标特征的特征提取方法的第一信息。这里,第一信息可以是实现目标特征的特征提取方法的程序代码。After acquiring the document to be processed, the third-party platform 102 can determine the target feature of the document to be processed and the feature extraction method of the target feature, and generate a configuration file, the configuration file at least includes the identification of the target feature of the document to be processed and the third-party platform 102 provides the target feature. The path information of the file package; the file package includes the first information representing the feature extraction method of the target feature. Here, the first information may be a program code implementing a feature extraction method of the target feature.
本申请实施例中,第三方平台102根据实际的特征提取需求确定目标特征,这里,目标特征可以是一个特征,也可以包括多个特征。本申请实施例中,目标特征的标识可以是名称、编号或其它标识。In this embodiment of the present application, the third-party platform 102 determines the target feature according to the actual feature extraction requirement. Here, the target feature may be one feature or may include multiple features. In this embodiment of the present application, the identifier of the target feature may be a name, a serial number, or other identifiers.
本申请实施例中,文件包可以包括:在面向对象的编程语言中提供至少一种功能的代码合集;示例性地,面向对象的编程语言可以是JAVA语言、C++语言等,在面向对象的编程语言可以是JAVA语言时,上述文件包可以是jar包。In this embodiment of the present application, the file package may include: a code collection that provides at least one function in an object-oriented programming language; exemplarily, the object-oriented programming language may be JAVA language, C++ language, etc., in the object-oriented programming language When the language can be the JAVA language, the above file package can be a jar package.
第三方平台102可以将配置文件和文件包发送至电子设备100。The third-party platform 102 may send the configuration file and file package to the electronic device 100 .
本申请实施例中,文件包的路径信息可以表示文件包在电子设备100中的存储位置;电子设备100可以根据配置文件确定文件包存储位置,在文件包中提取第一信息,并根据第一信息在待处理文档中提取出目标特征。In this embodiment of the present application, the path information of the file package may indicate the storage location of the file package in the electronic device 100; the electronic device 100 may determine the storage location of the file package according to the configuration file, extract the first information in the file package, and store the file package according to the first information. The information extracts target features in the document to be processed.
下面结合图1所示的应用场景,对本申请实施例的文档处理方法进行示例性说明。The document processing method according to the embodiment of the present application is exemplarily described below with reference to the application scenario shown in FIG. 1 .
图2为本申请实施例提供的文档处理方法的一个可选的流程图,如图2所示,该流程可以包括:FIG. 2 is an optional flowchart of a document processing method provided by an embodiment of the present application. As shown in FIG. 2 , the flowchart may include:
步骤201:获取待处理文档。Step 201: Obtain the document to be processed.
步骤202:接收第三方平台发送的配置文件。Step 202: Receive the configuration file sent by the third-party platform.
这里,步骤201至步骤202的实现方式已经在前述记载的内容中作出说明,这里不再赘述。Here, the implementation manners of steps 201 to 202 have been described in the above-mentioned contents, and are not repeated here.
步骤203:在目标特征的标识与默认特征的标识不同的情况下,基于文件包的路径信息获取文件包。Step 203: In the case that the identifier of the target feature is different from the identifier of the default feature, acquire the file package based on the path information of the file package.
本申请实施例中,默认特征为电子设备预先确定的特征,对于默认特征,默认特征的提取方式也是预先确定的。In this embodiment of the present application, the default feature is a feature predetermined by the electronic device, and for the default feature, the extraction method of the default feature is also predetermined.
在目标特征的标识与默认特征的标识不同的情况下,说明目标特征不是默认特征,需要采用针对目标特征确定特征提取方式,此时,可以基于配置文件中文件包的路径信息读取出文件包。If the identifier of the target feature is different from that of the default feature, it means that the target feature is not the default feature, and the feature extraction method needs to be determined for the target feature. At this time, the file package can be read based on the path information of the file package in the configuration file. .
步骤204:基于文件包中的第一信息,在待处理文档中提取出目标特征。Step 204: Extract target features from the document to be processed based on the first information in the file package.
本申请实施例中,第一信息表征目标特征的特征提取方法,因而,基于第一信息,可以确定目标特征的特征提取方法,进而,可以在待处理文档中提取出目标特征。In the embodiment of the present application, the first information represents the feature extraction method of the target feature. Therefore, based on the first information, the feature extraction method of the target feature can be determined, and further, the target feature can be extracted from the document to be processed.
在本申请的一些实施例中,目标特征的特征提取方法是基于自然语言处理(Natural Language Processing,NLP)方法或其它文档处理方法实现的。在一些实施例中,目标特征的特征提取方法可以包括第一方法和第二方法,其中,第一方法可以记为doCalculator方法,第二方法可以记为featureCalculate方法。In some embodiments of the present application, the feature extraction method of the target feature is implemented based on a natural language processing (Natural Language Processing, NLP) method or other document processing methods. In some embodiments, the feature extraction method of the target feature may include a first method and a second method, wherein the first method may be denoted as a doCalculator method, and the second method may be denoted as a featureCalculate method.
本申请实施例中,基于第一方法对待处理文档进行处理可以包括:1)使用NLP方法对待处理文档进行切词,进而统计词粒度的数据;2)使用NLP方法对待处理文档进行分句,进而统计句子粒度的数据;3)去除高频词和语气词并进行去噪处理;4)提取待处理文档中主标题、副标题、字体大小等数据,例如可以使用面向微软文档的JAVA应用程序编程接口(the JAVA Application Programming Interface for Microsoft Document,Apache POI)提取待处理文档中主标题、副标题、字体大小等数据。In the embodiment of the present application, processing the document to be processed based on the first method may include: 1) using the NLP method to segment the document to be processed, and then to count data of word granularity; 2) using the NLP method to segment the document to be processed, and then Statistical sentence granularity data; 3) Remove high-frequency words and modal particles and perform denoising processing; 4) Extract data such as main title, subtitle, font size and other data in the document to be processed, for example, the JAVA application programming interface for Microsoft documents can be used (the JAVA Application Programming Interface for Microsoft Document, Apache POI) Extract data such as the main title, subtitle, font size and other data in the document to be processed.
在一些实施例中,可以根据待处理文档的不同语言,采用的不同语言包对待处理文档进行处理,例如,在待处理文档为中文文档时,可以采用汉语言处理包(Han Language  Processing,HanLP)对待处理文档进行切词或分句;在在待处理文档为英文文档时,可以采用英语语言处理包对待处理文档进行切词或分句。In some embodiments, different language packages may be used to process the to-be-processed document according to different languages of the to-be-processed document. For example, when the to-be-processed document is a Chinese document, a Chinese language processing package (Han Language Processing, HanLP) may be used The to-be-processed document is segmented or sentenced; when the to-be-processed document is an English document, an English language processing package can be used to segment the to-be-processed document or a sentence.
本申请实施例中,基于第一方法对待处理文档进行处理后,可以得到待处理文档的初步处理结果,该初步处理结果包括特征的取值;然后,可以基于第二方法对处理文档的初步处理结果进行进一步处理,例如,基于第二方法可以对离散的特征取值进行归一化处理,对连续的特征取值进行均值化处理。In the embodiment of the present application, after processing the document to be processed based on the first method, a preliminary processing result of the document to be processed can be obtained, and the preliminary processing result includes the value of the feature; then, the preliminary processing of the document to be processed based on the second method can be obtained. The result is further processed, for example, based on the second method, discrete feature values may be normalized, and continuous feature values may be averaged.
需要说明的是,上述记载的内容仅仅是对第一方法和第二方法的实现方式进行了示例性说明,本申请实施例并不局限于此。It should be noted that the above-mentioned contents are merely illustrative for the implementation of the first method and the second method, and the embodiments of the present application are not limited thereto.
在本申请的一些实施例中,在第一信息是实现目标特征的特征提取方法的程序代码的情况下,可以执行该程序代码,得到目标特征。In some embodiments of the present application, when the first information is a program code for implementing a feature extraction method for a target feature, the program code may be executed to obtain the target feature.
在实际应用中,步骤201至步骤204可以基于电子设备的处理器实现,上述处理器可以是特定用途集成电路(Application Specific Integrated Circuit,ASIC)、数字信号处理器(Digital Signal Processor,DSP)、数字信号处理装置(Digital Signal Processing Device,DSPD)、可编程逻辑装置(Programmable Logic Device,PLD)、现场可编程门阵列(Field Programmable Gate Array,FPGA)、中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器中的至少一种。可以理解地,实现上述处理器功能的电子器件还可以为其它,本申请实施例不作限制。In practical applications, steps 201 to 204 may be implemented based on a processor of an electronic device, and the above-mentioned processor may be an application-specific integrated circuit (ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital Signal Processing Device (Digital Signal Processing Device, DSPD), Programmable Logic Device (Programmable Logic Device, PLD), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), Central Processing Unit (Central Processing Unit, CPU), control at least one of a device, a microcontroller, and a microprocessor. It can be understood that the electronic device that implements the function of the above processor may also be other, which is not limited in the embodiment of the present application.
可以看出,在本申请实施例中,在需要提取待处理文档的目标特征且目标特征不是默认特征的情况下,为了实现目标特征提取,不需要在本地编写并运行的新的程序代码,而是可以直接从第三方平台获取目标特征的提取方法,在一定程度上降低了时间成本和人力成本。It can be seen that, in the embodiment of the present application, in the case where the target feature of the document to be processed needs to be extracted and the target feature is not the default feature, in order to achieve the target feature extraction, no new program code written and run locally is not required, but It is an extraction method that can directly obtain target features from third-party platforms, which reduces time and labor costs to a certain extent.
进一步地,如果需要修改、新增或删除目标特征,第三方平台可以在配置文件中修改、新增或删除目标特征的标识,并且修改文件包的内容,这样,电子设备不需要可以并不需要在本地编写并运行的新的程序代码,而是可以直接基于接收到的配置文件和文件包进行目标特征的提取。Further, if the target feature needs to be modified, added or deleted, the third-party platform can modify, add or delete the identification of the target feature in the configuration file, and modify the content of the file package, so that the electronic device does not need to be The new program code written and run locally can extract the target features directly based on the received configuration files and file packages.
在本申请的一些实施例中,上述文件包包括自定义类,上述第一信息位于自定义类中。In some embodiments of the present application, the above-mentioned file package includes a custom class, and the above-mentioned first information is located in the custom class.
这里,文件包中的类表示面向对象的编程语言中具有相同属性和行为的一些对象的统称或集合,对象是对客观事物的抽象,类是对对象的抽象,是一种抽象的数据类型;第三方平台可以在自定义类后,可以将第一信息设置于自定义类中。Here, the class in the file package represents the collective name or collection of some objects with the same attributes and behaviors in the object-oriented programming language. The object is the abstraction of objective things, and the class is the abstraction of the object, which is an abstract data type; After customizing the class, the third-party platform can set the first information in the custom class.
本申请实施例中,还可以通过程序语言的反射机制,加载文件包中的自定义类,并从加载的自定义类中获取第一信息。In the embodiment of the present application, the self-defined class in the file package can also be loaded through the reflection mechanism of the programming language, and the first information can be obtained from the loaded self-defined class.
这里,程序语言的反射机制表示指程序可以访问、检测和修改本身状态或行为的一种能力;在一个示例中,JAVA语言的反射(reflection)机制是指在程序的运行状态中,可以构造任意一个类的对象,可以了解任意一个对象所属的类,可以了解任意一个类的成员变量和方法,可以调用任意一个对象的属性和方法。这种动态获取程序信息以及动态调用对象的功能称为JAVA语言的反射机制。Here, the reflection mechanism of the programming language refers to the ability of the program to access, detect and modify its own state or behavior; in an example, the reflection mechanism of the JAVA language refers to the ability to construct any arbitrary state or behavior in the running state of the program. An object of a class can know the class to which any object belongs, the member variables and methods of any class, and the properties and methods of any object. This function of dynamically obtaining program information and dynamically calling objects is called the reflection mechanism of JAVA language.
在目前的JAVA相关技术中,如果要使用第三方的方法,通常是采用导入(import)的方法加载文件包中的类,然而,在采用导入的方法前,需要提取引入文件包的类,因而需要预先获知文件包的类;在文件包的类未知的情况下,并不能通过导入的方法加载文件包的类;不能根据实时接收的文件包的类,实现文件包中的类的动态加载。In the current JAVA related technologies, if a third-party method is to be used, the import method is usually used to load the classes in the file package. However, before the import method is used, the classes imported into the file package need to be extracted, so The class of the file package needs to be known in advance; if the class of the file package is unknown, the class of the file package cannot be loaded by the imported method; the class in the file package cannot be dynamically loaded according to the class of the file package received in real time.
而在本申请实施例中,可以通过程序语言的反射机制加载文件包中的自定义类,也就是说,无论文件包中的自定义类是已知的还是未知的,均可以基于程序语言的反射机制的原理,不需要提前引入文件包中的自定义类,可以实现文件包中的自定义类的加载;在实时接收文件包的情况下,可以实现文件包中的自定义类的动态加载。However, in the embodiment of the present application, the self-defined class in the file package can be loaded through the reflection mechanism of the programming language, that is, regardless of whether the custom class in the file package is known or unknown, it can be based on the program language. The principle of reflection mechanism does not require the introduction of custom classes in the file package in advance, and the loading of custom classes in the file package can be realized; in the case of receiving the file package in real time, the dynamic loading of the custom class in the file package can be realized .
在本申请的一些实施例中,电子设备可以预先与第三方平台约定文件包的标识和/或文件包中自定义类的标识,示例性地,文件包的标识可以是文件包的名称或其它标识,文件包中自定义类的标识可以是自定义类的名称、自定义类的编号或其它标识。In some embodiments of the present application, the electronic device may agree with the third-party platform in advance on the identifier of the file package and/or the identifier of the custom class in the file package. For example, the identifier of the file package may be the name of the file package or other Identification, the identification of the custom class in the file package can be the name of the custom class, the number of the custom class or other identifications.
可以理解的是,虽然电子设备与第三方平台约定文件包的标识和/或文件包中自定义类的标识,但是,在第三方平台收到恶意攻击或者第三方平台没有按照约定要求生成文件包的标识或文件包的自定义类的标识时,第三方平台发送的文件包的标识与约定的文件包的标识存在区别,和/或,第三方平台发送的文件包中自定义类的标识与约定的自定义类的标识存在区别,会导致第三方平台提供的文件包并不符合实际需求。It is understandable that although the electronic device and the third-party platform agree on the identification of the file package and/or the identification of the custom class in the file package, the third-party platform receives a malicious attack or the third-party platform does not generate the file package according to the agreed requirements. When the identifier of the file package or the identifier of the custom class of the file package is different, the identifier of the file package sent by the third-party platform is different from the identifier of the agreed file package, and/or the identifier of the custom class in the file package sent by the third-party platform is different from the identifier of the custom class. There is a difference in the identification of the agreed custom class, which will cause the file package provided by the third-party platform to not meet the actual requirements.
在本申请的一些实施例中,上述配置文件还可以包括第二信息,第二信息包括:第三方平台提供的文件包的标识和/或上述自定义类的标识。In some embodiments of the present application, the above configuration file may further include second information, where the second information includes: an identifier of a file package provided by a third-party platform and/or an identifier of the above-mentioned custom class.
相应地,通过程序语言的反射机制,加载所述文件包中的所述自定义类的一种实现方式可以是,在确定所述配置文件中的第二信息为预先与第三方平台约定的信息的情况下,通过程序语言的反射机制,加载文件包中的自定义类。Correspondingly, through the reflection mechanism of the programming language, an implementation manner of loading the custom class in the file package may be, after determining that the second information in the configuration file is the information pre-agreed with the third-party platform. In the case of , through the reflection mechanism of the programming language, load the custom class in the file package.
需要说明的是,在确定所述配置文件中的第二信息不是预先与第三方平台约定的信息的情况下,可以忽略接收到的文件包。It should be noted that, in the case where it is determined that the second information in the configuration file is not the information pre-agreed with the third-party platform, the received file package may be ignored.
可以看出,在配置文件中的第二信息为预先与第三方平台约定的信息的情况下,说明配置文件中的第二信息是正确的信息,在此基础上,加载文件包中的自定义类,有利于准确地从自定义类中获取第一信息,进而,有利于准确地提取目标特征。It can be seen that if the second information in the configuration file is the information agreed with the third-party platform in advance, it means that the second information in the configuration file is correct information. The class is beneficial to accurately obtain the first information from the custom class, and further, it is beneficial to accurately extract the target feature.
在目前的JAVA相关技术中,并未对第三方平台提供的文件包进行鉴权,因而,如果黑客等恶意攻击者获知文件包中的自定义类名等信息,就可以通过仿照文件包来实现对电子设备的攻击。In the current JAVA related technologies, the file package provided by the third-party platform is not authenticated. Therefore, if a malicious attacker such as a hacker learns the information such as the custom class name in the file package, it can be realized by imitating the file package. Attacks on electronic devices.
针对该技术问题,在本申请的一些实施例中,电子设备还可以获取预先设置的第二信息的加密方式;相应地,在接收所述第三方平台发送的配置文件之后,基于与第二信息的加密方式对应的解密方式,对配置文件中的加密信息进行解密,得到第二信息;其中,加密信息是基于上述加密方式对第二信息进行加密得到的。In response to this technical problem, in some embodiments of the present application, the electronic device may also obtain a preset encryption method of the second information; correspondingly, after receiving the configuration file sent by the third-party platform, based on the second information The encrypted information in the configuration file is decrypted according to the decryption method corresponding to the encryption method, and the second information is obtained; wherein, the encrypted information is obtained by encrypting the second information based on the above-mentioned encryption method.
在一些实施例中,电子设备可以在接收第三方平台发送配置文件之前,获取预先设置的第二信息的加密方式;示例性地,预先设置的第二信息的加密方式可以是电子设备与第三方平台约定第二信息的加密方式。In some embodiments, the electronic device may obtain a preset encryption mode of the second information before receiving the configuration file sent by the third-party platform; for example, the preset encryption mode of the second information may be the electronic device and the third party. The platform agrees on the encryption method of the second information.
这里,第三方平台与电子设备约定第二信息的加密方式后,在生成第二信息,可以利用约定的加密方式对第二信息进行加密,得到加密信息;然后,可以将包括加密信息的配置文件发送至电子设备。Here, after the third-party platform and the electronic device agree on the encryption method of the second information, after generating the second information, the second information can be encrypted by using the agreed encryption method to obtain the encrypted information; then, the configuration file including the encrypted information can be encrypted. sent to electronic device.
本申请的一些实施例中,上述加密方式和解密方式可以根据实际情况进行设置,例如,可以基于数据加密标准(Data Encryption Standard,DES)等对称加密方法确定加密方式和解密方式,也可以基于非对称加密方法确定加密方式和解密方式。In some embodiments of the present application, the above encryption method and decryption method may be set according to actual conditions. For example, the encryption method and the decryption method may be determined based on a symmetric encryption method such as the Data Encryption Standard (DES), or may be determined based on a non-symmetric encryption method. The symmetric encryption method determines the encryption method and the decryption method.
图3为本申请实施例中实现配置文件中的信息加密传输的一个流程图,参照图3,在基于DES确定加密方式和解密方式的情况下,实现配置文件中的信息加密传输的流程可以包括:FIG. 3 is a flowchart of implementing encrypted transmission of information in a configuration file in an embodiment of the application. Referring to FIG. 3 , when an encryption mode and a decryption mode are determined based on DES, the process of implementing encrypted transmission of information in the configuration file may include: :
步骤301:电子设备将公钥和私钥发送至第三方平台,Step 301: The electronic device sends the public key and the private key to the third-party platform,
本申请实施例中,电子设备可以与与第三方平台约定上述第二信息;电子设备可以将公钥、私钥和约定的第二信息存储在数据库中,以便于后续校验;In this embodiment of the present application, the electronic device may agree with the third-party platform on the above-mentioned second information; the electronic device may store the public key, the private key and the agreed second information in a database, so as to facilitate subsequent verification;
步骤302:第三方平台利用私钥对第二信息进行加密。Step 302: The third-party platform encrypts the second information by using the private key.
本申请实施例中,第三方平台在收到私钥后,不需要对文件包和文件包中的类进行直接加密,而是在将第二信息写入至配置文件后,利用私钥对第二信息进行加密。In the embodiment of the present application, after receiving the private key, the third-party platform does not need to directly encrypt the file package and the classes in the file package, but after writing the second information into the configuration file, uses the private key to encrypt the first The second information is encrypted.
步骤303:第三方平台将私钥对应的公钥写入至配置文件中,并将配置文件发送至电子设备。Step 303: The third-party platform writes the public key corresponding to the private key into the configuration file, and sends the configuration file to the electronic device.
本申请实施例中,第三方平台在利用私钥对配置文件的第二信息进行加密,并将与私钥对应的公钥写入至配置文件后,可以将配置文件发送至电子设备。In the embodiment of the present application, after encrypting the second information of the configuration file with the private key, and writing the public key corresponding to the private key into the configuration file, the third-party platform can send the configuration file to the electronic device.
在另一些实施例中,配置文件还包括目标特征的特征提取方法的标识,相应地,第三方平台还可以利用私钥对目标特征的特征提取方法的标识进行加密;其中,目标特征的特征提取方法的标识可以是名称等信息。In other embodiments, the configuration file further includes the identification of the feature extraction method of the target feature. Correspondingly, the third-party platform can also use the private key to encrypt the identification of the feature extraction method of the target feature; wherein, the feature extraction method of the target feature The identification of the method can be information such as name.
步骤304:电子设备查找公钥对应的私钥。Step 304: The electronic device searches for the private key corresponding to the public key.
本申请实施例中,当电子设备接收到配置文件后,可以读取配置文件中的路径信息、以及公钥等信息;在数据库中查找该公钥对应的私钥。In the embodiment of the present application, after receiving the configuration file, the electronic device can read the path information and the public key in the configuration file, and search the database for the private key corresponding to the public key.
步骤305:电子设备利用私钥对配置文件中加密的信息进行解密。Step 305: The electronic device decrypts the encrypted information in the configuration file by using the private key.
本申请实施例中,上述步骤304和步骤305均可以通过电子设备中运行的程序实现。In this embodiment of the present application, both the above steps 304 and 305 may be implemented by a program running in an electronic device.
如果解密后的信息中文件包的标识和/或自定义类的标识与约定的第二信息相符,则说明文件包是正确的数据包。If the identifier of the file package and/or the identifier of the custom class in the decrypted information is consistent with the agreed second information, it means that the file package is a correct data package.
可以看出,本申请实施例可以通过约定配置文件中第二信息的加密方式,使第三方平台对第二信息进行加密,并且在接收到第三方平台发送的配置文件后,可以基于与第三方平台约定的加密方式对应的解密方式进行解密,因而,可以实现第二信息的加密传输,有利于提高第二信息的安全性,降低第二信息被攻击的风险。It can be seen that the embodiment of the present application can make the third-party platform encrypt the second information by agreeing on the encryption method of the second information in the configuration file, and after receiving the configuration file sent by the third-party platform, it can be based on the third-party platform. The decryption method corresponding to the encryption method agreed by the platform is decrypted, so the encrypted transmission of the second information can be realized, which is beneficial to improve the security of the second information and reduce the risk of the second information being attacked.
在本申请的一些实施例中,电子设备可以预先确定抽象类,并设置自定义类继承所述预先确定的抽象类;示例性地,电子设备可以与第三方平台约定自定义类继承预先确定的抽象类。In some embodiments of the present application, the electronic device may predetermine an abstract class, and set a custom class to inherit the predetermined abstract class; for example, the electronic device may agree with a third-party platform that the custom class inherits the predetermined abstract class abstract class.
这里,抽象类表示不能实例化为对象的类;继承是面向对象软件技术当中的一个概念,可以使得子类具有父类的属性和方法,或者,使子类从父类继承方法,使得子类具有父类相同的行为。Here, an abstract class represents a class that cannot be instantiated as an object; inheritance is a concept in object-oriented software technology, which can make a subclass have the properties and methods of the parent class, or make a subclass inherit methods from the parent class, so that the subclass can have the properties and methods of the parent class. Has the same behavior as the parent class.
在实际应用中,电子设备可以通过第三方平台的交互,约定文件包中的自定义类继承抽象类;可以理解的是,虽然电子设备与第三方平台约定自定义类继承预先确定的抽象类,但是,在第三方平台收到恶意攻击或者第三方平台没有按照约定要求继承抽象类的情况下,第三方平台提供的文件包中的类实际上并未继承上述抽象类。In practical applications, the electronic device can, through the interaction of the third-party platform, agree that the custom class in the file package inherits the abstract class; it is understandable that although the electronic device and the third-party platform agree that the custom class inherits the predetermined abstract class, However, when the third-party platform receives a malicious attack or the third-party platform does not inherit the abstract class as agreed upon, the classes in the file package provided by the third-party platform do not actually inherit the above-mentioned abstract class.
在目前的JAVA相关技术中,如果第三方平台提供的文件包中的自定义类并未继承预先确定的抽象类,则可能导致电子设备无法从自定义类中获取第一信息。In the current JAVA-related technology, if the custom class in the file package provided by the third-party platform does not inherit the predetermined abstract class, it may cause that the electronic device cannot obtain the first information from the custom class.
针对上述技术问题,在本申请实施例中,从自定义类中获取第一信息的实现方式可以是,将自定义类实例化为对象,在对象属于抽象类的情况下,从加载的自定义类中获取第一信息。In view of the above technical problems, in the embodiment of the present application, the implementation manner of obtaining the first information from the custom class may be to instantiate the custom class as an object, and if the object belongs to an abstract class, obtain the first information from the loaded custom class. Class to get the first information.
需要说明的是,在确定对象不属于抽象类的情况下,可以忽略接收到的文件包。It should be noted that in the case where it is determined that the object does not belong to the abstract class, the received file package can be ignored.
在本申请的一些实施例中,电子设备在确定接收到的文件包为正确的数据包后,需要判断文件包中的类是否继承上述预先确定的抽象类;在一种实施方式中,可以在自定义类加载器URLClassloader中通过设置setAccessible参数,实现对JAVA反射功能的支持,这样,可以利用自定义类加载器URLClassloader加载文件包中的自定义类,并将加载的自定义类实例化为对象;然后,可以利用运算符java.getInstanceOf()判断自定义类实例化后的对象是否属于抽象类;如果属于自定义类实例化后的对象属于抽象类,则说明文件包中的类继承了抽象类,此时,可以从自定义的类中获取第一信息;如果属于自定义类实例化后的对象不属于抽象类,则说明文件包中的类未继承抽象类,可以忽略文件包。In some embodiments of the present application, after determining that the received file package is a correct data package, the electronic device needs to determine whether the class in the file package inherits the above-mentioned predetermined abstract class; The custom class loader URLClassloader supports the JAVA reflection function by setting the setAccessible parameter. In this way, the custom class loader URLClassloader can be used to load the custom class in the file package and instantiate the loaded custom class as an object. ; Then, you can use the operator java.getInstanceOf() to judge whether the object instantiated by the custom class belongs to the abstract class; if the object instantiated by the custom class belongs to the abstract class, it means that the class in the file package inherits the abstract class class, at this time, the first information can be obtained from the custom class; if the instantiated object belonging to the custom class does not belong to the abstract class, it means that the class in the file package does not inherit the abstract class, and the file package can be ignored.
可以看出,本申请实施例中,在自定义类实例化的对象属于抽象类的情况下,可以认为自定义类为正确的类,在此基础上,有利于准确地从自定义类中获取第一信息,进而,有利于准确地提取目标特征。It can be seen that in the embodiment of the present application, when the object instantiated by the custom class belongs to an abstract class, it can be considered that the custom class is the correct class, and on this basis, it is beneficial to accurately obtain from the custom class The first information, in turn, facilitates accurate extraction of target features.
在本申请的一些实施例中,在目标特征的标识与默认特征的标识相同的情况下,说 明目标特征是默认特征,此时,可以基于预先确定的所述默认特征的提取方式,在所述待处理文档中提取出目标特征。In some embodiments of the present application, when the identifier of the target feature is the same as the identifier of the default feature, it means that the target feature is the default feature. In this case, based on the predetermined extraction method of the default feature, the The target features are extracted from the documents to be processed.
可以看出,本申请实施例对于目标特征为默认特征的情况,无需从第三方平台获取目标特征的提取方式,而是可以基于预先确定的默认特征的提取方式实现目标特征提取,具有易于实现的特点。It can be seen that, for the case where the target feature is the default feature, the embodiment of the present application does not need to obtain the target feature extraction method from the third-party platform, but can realize the target feature extraction based on the predetermined default feature extraction method, which is easy to implement. Features.
在一些实施例,在待处理文档的目标特征包括多个特征时,目标特征中的各个特征可以均为默认特征,或者,目标特征中的各个特征可以都不是默认特征,或者,目标特征中的一部分目标特征为默认特征,另一部分特征不是默认特征;可以看出,无论目标特征是否为默认特征,本申请实施例均给出了相应的特征提取方式。In some embodiments, when the target feature of the document to be processed includes multiple features, each feature in the target feature may be a default feature, or each feature in the target feature may not be a default feature, or, in the target feature, each feature may be a default feature. Some of the target features are default features, and another part of the features are not default features; it can be seen that, regardless of whether the target features are default features, the embodiments of the present application provide corresponding feature extraction methods.
采用本申请实施例的方法进行文档处理时,在电子设备中只需要针对默认特征的提取方式部署程序代码;在目标特征不是默认特征的情况下,只需要根据第三方平台发送的配置文件和文件包,并基于JAVA语言的反射机制,就可以提取出相应的目标特征。When using the method of the embodiment of the present application for document processing, only the program code needs to be deployed in the electronic device for the extraction method of the default feature; when the target feature is not the default feature, only the configuration files and files sent by the third-party platform are required. package, and based on the reflection mechanism of the JAVA language, the corresponding target features can be extracted.
如果配置文件中的目标特征的标识仅仅为默认特征的标识,说明只需要默认特征即可,无需从提取针对待处理文档提取新特征。如果针对待处理文档提取非默认特征,第三方平台可以将非默认特征的标识写入到配置文件中,并将配置文件和相应的文件包发送至电子设备;电子设备便可以根据配置文件和文件包提取出新的非默认特征。也就是说,第三方平台可以根据待处理文档的目标特征的提取需求,确定配置文件的内容和文件包的内容,在需要提取的目标特征发生变化时,只需要更改配置文件中的目标特征的标识和文件包的内容即可。If the identifier of the target feature in the configuration file is only the identifier of the default feature, it means that only the default feature is required, and there is no need to extract new features from the extraction target document to be processed. If non-default features are extracted for the document to be processed, the third-party platform can write the identification of the non-default features into the configuration file, and send the configuration file and the corresponding file package to the electronic device; the electronic device can The package extracts new non-default features. That is to say, the third-party platform can determine the content of the configuration file and the content of the file package according to the extraction requirements of the target feature of the document to be processed. When the target feature to be extracted changes, it only needs to change the target feature in the configuration file. The logo and the contents of the package are sufficient.
在一些实施例中,为实现文档的质量评估,需要提取的目标特征中的大部分特征可以是默认特征;对于不同类型的文档,可能需要提取新的非默认特征,在这种情况下,针对不同类型的文档,第三方平台可以向电子设备发送不同的jar包并确定配置文件的不同内容,如此,电子设备可以直接根据不同的jar包,采用第三方平台提供的特征提取方法进行非默认特征的提取,与相关技术中需要在电子设备本地编写并运行的新的程序代码的方案相比,节省了人力成本和时间成本。In some embodiments, in order to achieve document quality assessment, most of the target features to be extracted may be default features; for different types of documents, new non-default features may need to be extracted, in this case, for For different types of documents, the third-party platform can send different jar packages to the electronic device and determine the different contents of the configuration file. In this way, the electronic device can directly use the feature extraction method provided by the third-party platform to perform non-default features according to different jar packages. Compared with the solution in the related art, which needs to write and run new program codes locally on the electronic device, labor cost and time cost are saved.
本申请的一些实施例中,上述文档处理方法可以通过电子设备上运行的一个主线程实现,下面结合图4进行示例性说明;图4为本申请实施例的文档处理方法的另一个可选的流程图,如图4所示,电子设备的主线程可以记为线程epicDocCalculate,基于电子设备的主线程实现的文档处理方法可以包括:In some embodiments of the present application, the above-mentioned document processing method may be implemented by a main thread running on an electronic device, and an exemplary description will be given below with reference to FIG. 4; FIG. 4 is another optional document processing method of the embodiment of the present application. The flowchart, as shown in Figure 4, the main thread of the electronic device can be denoted as the thread epicDocCalculate, and the document processing method implemented based on the main thread of the electronic device can include:
步骤401:读取配置文件和文件包。Step 401: Read the configuration file and the file package.
本申请实施例中,电子设备的主线程可以读取第三方平台发送的配置文件和文件包。In this embodiment of the present application, the main thread of the electronic device can read the configuration file and the file package sent by the third-party platform.
步骤402:判断目标特征的标识是否与默认特征的标识相同,在判断结果为是时,执行步骤403;在判断结果为否时,执行步骤404。Step 402: Determine whether the identifier of the target feature is the same as the identifier of the default feature. When the determination result is yes, step 403 is performed; when the determination result is no, step 404 is performed.
本申请实施例中,电子设备的主线程可以基于配置文件,判断待处理文档的各个目标特征标识是否与默认特征的标识相同。In this embodiment of the present application, the main thread of the electronic device may determine, based on the configuration file, whether each target feature identifier of the document to be processed is the same as the default feature identifier.
步骤403:提取默认特征。Step 403: Extract default features.
本申请实施例中,可以基于预先确定的所述默认特征的提取方式实现默认特征提取。In this embodiment of the present application, the default feature extraction may be implemented based on a predetermined extraction manner of the default feature.
步骤404:判断文件包和文件包中的类是否正确,在文件包和文件包中的类均正确时,执行步骤405;在文件包或文件包中的类不正确时,返回至步骤401。Step 404: Determine whether the file package and the class in the file package are correct, if both the file package and the class in the file package are correct, go to step 405; if the file package or the class in the file package is incorrect, return to step 401.
本申请实施例中,可以基于前述记载内容判断文件包和文件包中的类是否正确,这里不再赘述。In this embodiment of the present application, it may be determined whether the file package and the class in the file package are correct based on the foregoing recorded content, which will not be repeated here.
步骤405:基于文件包中的第一信息,在待处理文档中提取出目标特征。Step 405: Extract target features from the document to be processed based on the first information in the file package.
可以看出,无论目标特征是否为默认特征,基于步骤401至步骤405均可以实现目标特征的提取。It can be seen that, regardless of whether the target feature is the default feature, the target feature extraction can be achieved based on steps 401 to 405 .
当然,在本申请的另一些实施例中,电子设备也可以在获取到待处理文档后,也可 以不接收第三方平台发送的配置文件,而是基于预先确定的所述默认特征的提取方式,直接在待处理文档中提取默认特征。Of course, in other embodiments of the present application, the electronic device may also not receive the configuration file sent by the third-party platform after acquiring the document to be processed, but based on the predetermined extraction method of the default feature, Extract default features directly in the document to be processed.
在本申请的一些实施例中,在提取出目标特征后,还可以基于目标特征对待处理文档进行质量评分,得出所述待处理文档的质量评分值,以实现对待处理文档的质量评估。In some embodiments of the present application, after the target feature is extracted, the document to be processed may also be scored based on the target feature to obtain a quality score value of the document to be processed, so as to realize the quality assessment of the document to be processed.
在一些实施例中,目标特征包括至少两个特征;配置文件包括至少两个特征中每个特征的权重信息。In some embodiments, the target feature includes at least two features; the profile includes weight information for each of the at least two features.
相应地,基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值的实现方式可以包括:Correspondingly, an implementation manner of performing a quality score on the document to be processed based on the target feature, and obtaining the quality score value of the document to be processed may include:
基于所述至少两个特征中各个特征的权重信息,对所述至少两个特征中各个特征进行加权求和运算,得出所述待处理文档的质量评分值。Based on the weight information of each of the at least two features, a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
本申请实施例中,可以根据公式(1)计算得出待处理文档的质量评分值。In this embodiment of the present application, the quality score value of the document to be processed can be calculated according to formula (1).
Figure PCTCN2021083679-appb-000001
Figure PCTCN2021083679-appb-000001
其中,S表示待处理文档的质量评分值,f i表示第i个特征,w i表示上述至少两个特征中第i个特征的权重,n表示上述至少两个特征的特征个数。 Among them, S represents the quality score value of the document to be processed, fi represents the ith feature, wi represents the weight of the ith feature in the above at least two features, and n represents the number of features of the above at least two features.
在一些实施例中,无论目标特征中的特征是否为默认特征,对于目标特征,第三方平台可以根据实际需求确定目标特征中的权重,或者,可以根据电子设备发送的目标特征的初始权重确定目标特征的权重。In some embodiments, regardless of whether the feature in the target feature is a default feature, for the target feature, the third-party platform may determine the weight of the target feature according to actual requirements, or may determine the target according to the initial weight of the target feature sent by the electronic device. feature weight.
在一些实施例中,电子设备可以预先确定目标特征的初始权重,并将目标特征的初始权重发送至第三方平台;第三方平台可以直接将初始权重作为对应特征的权重,或者,可以在初始权重的基础上进行修改,得到对应特征的权重。In some embodiments, the electronic device may pre-determine the initial weight of the target feature, and send the initial weight of the target feature to the third-party platform; the third-party platform may directly use the initial weight as the weight of the corresponding feature, or, the initial weight may be used in the initial weight. On the basis of , modify it to get the weight of the corresponding feature.
下面通过表1和表2示例性地说明两个配置文件的内容。The contents of the two configuration files are exemplarily described below through Table 1 and Table 2.
表1Table 1
key value 解释explain
PublicKeyPublicKey !#abc$dce! #abc$dce 9位的随机字符9 random characters
ClassLocationClassLocation /lib/mycalculator.jar/lib/mycalculator.jar jar包位置jar package location
ClassNameClassName myAlgorithmmyAlgorithm 自定义类的名称(加密)The name of the custom class (encrypted)
featureNamefeatureName [A,B,C,D][A,B,C,D] 默认的特征提取方法Default feature extraction method
特征权重Feature weights [0.1,0.2,0.1,0.1][0.1,0.2,0.1,0.1] 初始权重initial weight
非默认特征名称Non-default feature name [D,E,F][D,E,F] 非默认特征(加密)Non-default feature (encryption)
非默认特征权重Non-default feature weights [0.1,0.2,0.2][0.1,0.2,0.2] 非默认特征权重Non-default feature weights
表1中,PublicKey表示公钥,ClassLocation表示jar包的路径,ClassName表示类名,featureName表示特征权重,ExternalFeatureName表示非默认特征名称,ExternalFeatureWeight表示非默认特征权重;A、B、C和D分别表示特征A、特征B、特征C和特征D,特征A、特征B、特征C和特征D表示不同的默认特征,特征A、特征B、特征C和特征D的权重为由电子设备确定的初始权重,特征A、特征B、特征C和特征D的权重分别为0.1、0.2、0.1和0.1。D、E和F表示特征D、特征E和特征F,特征D、特征E和特征F均为非默认特征,在表1中,特征D、特征E和特征F的权重分别为0.1、0.2和0.2。In Table 1, PublicKey represents the public key, ClassLocation represents the path of the jar package, ClassName represents the class name, featureName represents the feature weight, ExternalFeatureName represents the non-default feature name, ExternalFeatureWeight represents the non-default feature weight; A, B, C and D represent the features respectively A, feature B, feature C and feature D, feature A, feature B, feature C and feature D represent different default features, the weights of feature A, feature B, feature C and feature D are the initial weights determined by the electronic device, The weights of Feature A, Feature B, Feature C, and Feature D are 0.1, 0.2, 0.1, and 0.1, respectively. D, E, and F represent feature D, feature E, and feature F. Feature D, feature E, and feature F are all non-default features. In Table 1, the weights of feature D, feature E, and feature F are 0.1, 0.2, and 0.2.
表2Table 2
key value 解释explain
ClassLocationClassLocation /lib/engCalculator.jar/lib/engCalculator.jar jar包位置jar package location
ClassNameClassName myAlgorithmmyAlgorithm 自定义类的名称(加密)The name of the custom class (encrypted)
featureNamefeatureName [A1][A1] 默认的特征提取方法Default feature extraction method
featureWeightfeatureWeight [0.4][0.4] 初始权重initial weight
非默认特征名称Non-default feature name [A2,A3,A4][A2,A3,A4] 非默认特征(加密)Non-default feature (encryption)
非默认特征权重Non-default feature weights [0.2,0.2,0.2][0.2,0.2,0.2] 非默认特征权重Non-default feature weights
表2中,ClassLocation、ClassName、featureName、ExternalFeatureName和ExternalFeatureWeight的含义与表1相同,这里不再赘述;A1、A2、A3和A4分别表示特征A1、特征A2、特征A3和特征A4,特征A1表示默认特征,特征A1的权重为由电子设备确定的初始权重,特征A1的权重为0.4;特征A2、特征A3和特征A4均为非默认特征,在表2中,特征A2、特征A3和特征A4的权重分别为0.2、0.2和0.2。In Table 2, the meanings of ClassLocation, ClassName, featureName, ExternalFeatureName, and ExternalFeatureWeight are the same as those in Table 1, and will not be repeated here; A1, A2, A3, and A4 represent feature A1, feature A2, feature A3, and feature A4, respectively, and feature A1 represents the default Features, the weight of feature A1 is the initial weight determined by the electronic device, the weight of feature A1 is 0.4; feature A2, feature A3 and feature A4 are all non-default features, in Table 2, feature A2, feature A3 and feature A4 The weights are 0.2, 0.2, and 0.2, respectively.
下面示例性地说明确定默认特征的初始权重的实现方式。The implementation manner of determining the initial weight of the default feature is exemplarily described below.
本申请实施例中,在默认特征包括多个特征的情况下,针对默认特征可以预先确定多个不同的候选权重组合,每个候选权重组合包括默认特征中各个特征的一个权重,每个候选权重组合中各个特征的权重之和等于1;在上述多个候选权重组合中选取出一个权重组合作为默认特征的初始权重。In the embodiment of the present application, when the default feature includes multiple features, multiple different candidate weight combinations may be predetermined for the default feature, each candidate weight combination includes a weight of each feature in the default feature, and each candidate weight The sum of the weights of each feature in the combination is equal to 1; a weight combination is selected from the above multiple candidate weight combinations as the initial weight of the default feature.
在一些实施例中,在上述多个候选权重组合中选取出一个权重组合的实现方式可以是:针对预先获取的样本文档获取人工评分值;并根据每个候选权重组合,对默认特征中各个特征的评分值进行加权求和运算,得到样本文档的质量评分值;在各个候选权重组合中,在满足设定条件的候选权重组合中选取出一个候选权重,设定条件为:样本文档的人工评分值与质量评分值的差值的绝对值小于设定值。在一种实施方式中,可以在满足设定条件的候选权重组合中,选取出使样本文档的人工评分值与质量评分值最接近的一个候选权重。In some embodiments, an implementation manner of selecting a weight combination from the above-mentioned multiple candidate weight combinations may be: obtaining a manual score value for a pre-acquired sample document; Perform a weighted sum operation on the score values of the sample document to obtain the quality score value of the sample document; in each candidate weight combination, select a candidate weight from the candidate weight combination that satisfies the set condition, and the set condition is: the manual score of the sample document The absolute value of the difference between the value and the quality score value is less than the set value. In one embodiment, a candidate weight whose manual rating value of the sample document is closest to the quality rating value may be selected from the candidate weight combinations that satisfy the set condition.
在一些实施例中,默认特征包括特征A5和特征A6;针对特征A5的权重,基于预设的步进0.05,从0.1遍历至0.9,确定出特征A5的多个权重;针对特征A5的每个权重,确定出特征A6的权重,从而得到各个候选权重组合;每个候选权重组合中各个特征的权重之和等于1。In some embodiments, the default features include feature A5 and feature A6; for the weight of feature A5, based on a preset step of 0.05, traversing from 0.1 to 0.9, multiple weights of feature A5 are determined; for each of feature A5 The weight of the feature A6 is determined, thereby obtaining each candidate weight combination; the sum of the weights of each feature in each candidate weight combination is equal to 1.
表3table 3
特征A5权重Feature A5 Weights 特征A6权重Feature A6 Weights
0.10.1 0.90.9
0.150.15 0.850.85
0.200.20 0.800.80
0.90.9 0.10.1
表3中示出了特征A5和特征A6的各个候选权重组合,表3的同一行表示一个候选权重组合。Each candidate weight combination of feature A5 and feature A6 is shown in Table 3, and the same row of Table 3 represents one candidate weight combination.
在得出特征A5和特征A6的各个候选权重组合之后,可以针对每个候选权重组合,确定样本文档的人工评分值与质量评分值的差值的绝对值;在特征A5表示文档长度、特征A6表示文档字数的情况下,表4中示出了每个候选权重组合对应的样本文档的人工评分值与质量评分值。After each candidate weight combination of feature A5 and feature A6 is obtained, the absolute value of the difference between the manual score value and the quality score value of the sample document can be determined for each candidate weight combination; In the case of indicating the number of words in the document, Table 4 shows the manual score value and quality score value of the sample document corresponding to each candidate weight combination.
表4Table 4
Figure PCTCN2021083679-appb-000002
Figure PCTCN2021083679-appb-000002
Figure PCTCN2021083679-appb-000003
Figure PCTCN2021083679-appb-000003
可以基于表4所示的人工评分值与质量评分值,按照前述记载的内容,在多个候选权重组合中选取出一个权重组合作为默认特征的初始权重。Based on the manual rating value and the quality rating value shown in Table 4, and according to the content described above, one weight combination may be selected from the multiple candidate weight combinations as the initial weight of the default feature.
在另一些实施例中,在目标特征同时包括默认特征和非默认特征的情况下,电子设备还可以同时确定默认特征和非默认特征的初始权重,并将默认特征和非默认特征的初始权重发送至第三方平台;第三方平台可以直接将默认特征和非默认特征的初始权重作为对应特征的权重,或者,可以在默认特征和非默认特征的初始权重的基础上进行修改,得到对应特征的权重。In other embodiments, when the target feature includes both a default feature and a non-default feature, the electronic device may also determine the initial weights of the default feature and the non-default feature at the same time, and send the initial weights of the default feature and the non-default feature to a third-party platform; the third-party platform can directly use the initial weight of the default feature and non-default feature as the weight of the corresponding feature, or it can modify the initial weight of the default feature and non-default feature to obtain the weight of the corresponding feature .
在一些实施例中,默认特征包括特征B1,非默认特征为特征B2;针对特征B1的权重,基于预设的步进0.05,从0.1遍历至0.9,确定出特征B1的多个权重;针对特征B1的每个权重,确定出特征B2的权重,从而得到各个候选权重组合;每个候选权重组合包括特征B1的权重和特征B2的权重,每个候选权重组合中特征B1的权重和特征B2的权重之和等于1。In some embodiments, the default feature includes feature B1, and the non-default feature is feature B2; for the weight of feature B1, based on a preset step of 0.05, traversing from 0.1 to 0.9, multiple weights of feature B1 are determined; for the feature For each weight of B1, the weight of feature B2 is determined to obtain each candidate weight combination; each candidate weight combination includes the weight of feature B1 and the weight of feature B2, and the weight of feature B1 and the weight of feature B2 in each candidate weight combination The sum of the weights is equal to 1.
在得出特征B1和特征B2的各个候选权重组合之后,可以针对每个候选权重组合,确定样本文档的人工评分值与质量评分值的差值的绝对值;在样本文档为英文文档,特征B1表示单词个数,且特征B2表示句子平均长度的情况下,表5中示出了每个候选权重组合对应的样本文档的人工评分值与质量评分值。After each candidate weight combination of feature B1 and feature B2 is obtained, the absolute value of the difference between the manual score value and the quality score value of the sample document can be determined for each candidate weight combination; when the sample document is an English document, the feature B1 When the number of words is represented, and the feature B2 represents the average length of the sentence, Table 5 shows the manual score value and quality score value of the sample document corresponding to each candidate weight combination.
表5table 5
Figure PCTCN2021083679-appb-000004
Figure PCTCN2021083679-appb-000004
可以基于表5所示的人工评分值与质量评分值,按照前述记载的内容,在多个候选权重组合中选取出一个权重组合作为默认特征和非默认特征的初始权重。Based on the manual rating value and the quality rating value shown in Table 5, and according to the content described above, one weight combination can be selected from the multiple candidate weight combinations as the initial weight of the default feature and the non-default feature.
下面示例性地说明得出待处理文档的质量评分值的两种实现方式。Two implementations for deriving the quality score value of the document to be processed are exemplarily described below.
第一种实现方式The first implementation
待处理文档为中文文档,待处理文档的目标特征包括长度相关特征、模板相关特征和词性相关特征;其中,长度相关特征表示待处理文档的字数,模板相关特征表示待处理文档与预设模板的相似性,词性相关特征表示预设词性的词的个数占待处理文档所有词的个数的比例,例如,预设词性包括动词和名词。The document to be processed is a Chinese document, and the target features of the document to be processed include length-related features, template-related features, and part-of-speech-related features; wherein, the length-related features represent the number of words in the to-be-processed document, and the template-related features represent the difference between the to-be-processed document and the preset template. Similarity, the part-of-speech-related feature represents the ratio of the number of words of the preset part-of-speech to the number of all words in the document to be processed. For example, the preset part-of-speech includes verbs and nouns.
本申请实施例中,可以预先确定多个不同字数区间,每个字数区间对应一个取值,这样,通过对字数的离散化数据处理,可以得出长度相关特征的取值。In the embodiment of the present application, a plurality of different character count intervals may be predetermined, and each character count interval corresponds to a value. In this way, the value of the length-related feature can be obtained by processing the discretized data of the character count.
在一些实施例中,可以按照表6确定长度相关特征的取值。In some embodiments, the value of the length-related feature can be determined according to Table 6.
表6Table 6
字数word count 长度相关特征的取值Values of length-dependent features
字数<100Word count < 100 00
100≤字数<500100≤words<500 0.20.2
500≤字数<900500≤words<900 0.40.4
900≤字数<1300900≤words<1300 0.60.6
1300≤字数<17001300≤words<1700 0.80.8
1700≤字数<20001700≤words<2000 11
字数>2000Word count > 2000 11
本申请实施例中,可以使用Apache POI在待处理文档和预设模板中提取出内容属性数据,内容属性数据可以包括以下至少之一:主标题、副标题、正文、总结,副标题可以根据字体大小分为一号标题、二号标题、三号标题、四号标题、五号标题等;在提取出内容属性数据之后,可以按照预设取值方式对内容属性数据进行取值,从而将内容属性数据转换为文档特征向量。In this embodiment of the present application, Apache POI can be used to extract content attribute data from the document to be processed and the preset template, and the content attribute data can include at least one of the following: main title, subtitle, body text, and summary. Title No. 1, Title No. 2, Title No. 3, Title No. 4, Title No. 5, etc; Convert to document feature vector.
在一些实施例中,预设模板的内容属性数据为:(标题,一号标题,正文,总结),预设模板的文档特征向量为[1,1,1,1];在待处理文档的内容属性数据不包含标题、一号标题、正文和总结的任一项的情况下,将待处理文档的文档特征向量设为全零的向量;在待处理文档的内容属性数据包含标题、一号标题、正文和总结的任一项的情况下,判断待处理文档的内容属性数据的任意一部分数据是否属于预设模板的内容属性数据,如果是,则文档特征向量中与任意一部分数据对应的向量分量的取值为1;如果否,则文档特征向量中与任意一部分数据对应的向量分量的取值为-1。In some embodiments, the content attribute data of the preset template is: (title, title No. 1, body, summary), and the document feature vector of the preset template is [1, 1, 1, 1]; When the content attribute data does not contain any of the title, title No. 1, text and summary, the document feature vector of the document to be processed is set to a vector of all zeros; when the content attribute data of the document to be processed contains title, No. 1 In the case of any one of title, text, and summary, determine whether any part of the content attribute data of the document to be processed belongs to the content attribute data of the preset template, and if so, the vector corresponding to any part of the data in the document feature vector The value of the component is 1; if not, the value of the vector component corresponding to any part of the data in the document feature vector is -1.
为了便于理解,下面通过三个示例进行说明,在第一个示例中,待处理文档为文档1,文档1的内容属性数据为:(标题,一号标题,正文,总结),则通过比较预设模板与文档1的内容属性数据,可以确定文档1的文档特征向量为[1,1,1,1];在第二个示例中,待处理文档为文档2,文档2的内容属性数据为:(标题,三号标题,四号标题,五号标题,正文,总结),则通过比较预设模板与文档2的内容属性数据,可以确定文档2的文档特征向量为[1,-1,-1,-1,1,1];在第三个示例中,待处理文档为文档3,文档3的内容属性数据(三号标题,四号标题,五号标题),可以看出,文档3的内容属性数据与预设模板的内容属性数据完全不同,文档3的内容属性数据并包含标题、一号标题、正文中总结的任一项,因此,可以确定文档3的文档特征向量为[0,0,0,]。For ease of understanding, the following three examples are used to illustrate. In the first example, the document to be processed is document 1, and the content attribute data of document 1 is: (title, title No. 1, text, summary), then by comparing the Assuming the content attribute data of the template and document 1, it can be determined that the document feature vector of document 1 is [1, 1, 1, 1]; in the second example, the document to be processed is document 2, and the content attribute data of document 2 is : (title, title 3, title 4, title 5, text, summary), then by comparing the preset template with the content attribute data of document 2, the document feature vector of document 2 can be determined as [1,-1, -1,-1,1,1]; in the third example, the document to be processed is document 3, and the content attribute data of document 3 (title 3, title 4, title 5), it can be seen that the document The content attribute data of 3 is completely different from the content attribute data of the preset template. The content attribute data of document 3 includes any one of the title, the first title, and the summary in the text. Therefore, it can be determined that the document feature vector of document 3 is [ 0,0,0,].
在得出待处理文档和预设模板的文档特征向量后,可以基于待处理文档和预设模板的文档特征向量,确定待处理文档与预设模板的相似性,即,确定模板相关特征的取值。After the document feature vectors of the document to be processed and the preset template are obtained, the similarity between the document to be processed and the preset template can be determined based on the document feature vectors of the document to be processed and the preset template, that is, the selection of the template-related features can be determined. value.
在一些实施例中,在待处理文档和预设模板的文档特征向量的维数相同时,待处理文档与预设模板的相似性可以为用余弦相似度,余弦相似度的计算公式为公式(2)。In some embodiments, when the dimensions of the document feature vectors of the document to be processed and the preset template are the same, the similarity between the document to be processed and the preset template may be a cosine similarity, and the calculation formula of the cosine similarity is the formula ( 2).
Figure PCTCN2021083679-appb-000005
Figure PCTCN2021083679-appb-000005
其中,G和H分别表示待处理文档和预设模板的文档特征向量,||G||表示向量G的长度,||H||表示向量H的长度,G·H表示向量G和向量H的点积,cos(θ)表示待处理文档与预设模板的余弦相似度,可以看出,cos(θ)表示模板相关特征的取值。Among them, G and H represent the document feature vectors of the document to be processed and the preset template, respectively, ||G|| represents the length of the vector G, ||H|| represents the length of the vector H, and G·H represents the vector G and the vector H The dot product of , cos(θ) represents the cosine similarity between the document to be processed and the preset template. It can be seen that cos(θ) represents the value of the template-related features.
可以理解地,余弦相似度表示两个向量的夹角余弦值,在余弦相似度较大时,说明向量G和向量H比较相似;反之,在余弦相似度较小时,说明向量G和向量H存在较大的差异。It can be understood that the cosine similarity represents the cosine value of the angle between the two vectors. When the cosine similarity is large, it means that the vector G and the vector H are relatively similar; on the contrary, when the cosine similarity is small, it means that the vector G and the vector H exist. larger difference.
在一些实施例中,在待处理文档为上述文档1的情况下,根据公式(2),可以确定 待处理文档和预设模板的余弦相似度为1,即,待处理文档的模板相关特征的取值为1;在待处理文档为上述文档1的情况下,根据公式(2),可以确定待处理文档和预设模板的余弦相似度为1,即,待处理文档的模板相关特征的取值为1。In some embodiments, when the document to be processed is the above-mentioned document 1, according to formula (2), it can be determined that the cosine similarity between the document to be processed and the preset template is 1, that is, the similarity of the template-related features of the document to be processed is 1. The value is 1; when the document to be processed is the above-mentioned document 1, according to formula (2), it can be determined that the cosine similarity between the document to be processed and the preset template is 1, that is, the value of the template-related features of the document to be processed is The value is 1.
本申请实施例中,可以根据待处理文档中名词和动词占待处理文档所有词的数量比例,确定词性相关特征;在一些实施例中,待处理文档中名词数为20,动词数为10,总词数为50,则词性相关特征的取值为0.6。In the embodiments of the present application, the part-of-speech related features may be determined according to the proportion of nouns and verbs in the document to be processed in all words in the document to be processed; in some embodiments, the number of nouns and the number of verbs in the document to be processed is 20, The total number of words is 50, and the value of part-of-speech-related features is 0.6.
在一些实施例中,待处理文档的字数大于2000,预设模板的文档特征向量为[1,1,1,1],待处理文档的文档特征向量为[1,1,1,1],待处理文档中名词和动词占待处理文档所有词的比例为0.6;则可以确定待处理文档的长度相关特征、模板相关特征和词性相关特征的取值分别为1、1和0.6;在长度相关特征、模板相关特征和词性相关特征的权重分别为0.2、0.4和0.4的情况下,待处理文档的质量评分值可以根据公式(1)计算得出,即,待处理文档的质量评分值为0.84;在一些实施例中,还可以将待处理文档的质量评分值乘以100,得出待处理文档在百分制下的质量评分值,这里,待处理文档在百分制下的质量评分值为84。In some embodiments, the word count of the document to be processed is greater than 2000, the document feature vector of the preset template is [1, 1, 1, 1], the document feature vector of the document to be processed is [1, 1, 1, 1], The ratio of nouns and verbs in the document to be processed to all words in the document to be processed is 0.6; it can be determined that the length-related features, template-related features and part-of-speech features of the to-be-processed document are 1, 1 and 0.6 respectively; When the weights of features, template-related features, and part-of-speech features are 0.2, 0.4, and 0.4, respectively, the quality score value of the document to be processed can be calculated according to formula (1), that is, the quality score value of the document to be processed is 0.84 ; In some embodiments, the quality score value of the document to be processed can also be multiplied by 100 to obtain the quality score value of the document to be processed under the percentile system. Here, the quality score of the document to be processed under the percentile system is 84.
第二种实现方式The second implementation
待处理文档为英文文档,待处理文档的目标特征包括特征C1、特征C2、特征C3和特征C4,其中,特征C1为默认特征,表示待处理文档的单词数;特征C2、特征C3和特征C4为非默认特征,特征C2表示待处理文档的句子平均长度,特征C3表示待处理文档的文档错误数,特征C4表示待处理文档的高级词汇数;这里,文档错误包括但不限于单词拼写错误、标点使用错误、每句首个单词的首字母未用大写字母等错误,高级词汇表示位于预先确定的高级词汇表中的词汇,在实际应用中,用户可以根据待处理文档的内容预先确定高级词汇表。The document to be processed is an English document, and the target features of the document to be processed include feature C1, feature C2, feature C3 and feature C4, wherein feature C1 is the default feature, indicating the number of words in the document to be processed; feature C2, feature C3 and feature C4 For non-default features, feature C2 represents the average sentence length of the document to be processed, feature C3 represents the number of document errors in the document to be processed, and feature C4 represents the number of advanced vocabulary of the document to be processed; here, document errors include but are not limited to word spelling errors, Errors in the use of punctuation, the first letter of the first word of each sentence is not capitalized, etc. Advanced vocabulary means vocabulary located in a predetermined advanced vocabulary. In practical applications, users can pre-determine advanced vocabulary according to the content of the document to be processed. surface.
在一些实施例中,可以预先确定多个不同单词数区间,每个单词数区间对应一个取值,这样,通过对单词数的离散化数据处理,可以得出特征C1的取值了;例如,可以在表6的基础上,将字数替换为单词数,便可以得到多个单词数区间和每个单词数区间对应的取值。In some embodiments, a plurality of different word count intervals may be predetermined, and each word count interval corresponds to a value. In this way, the value of the feature C1 can be obtained by processing the discretized data of the word count; for example, On the basis of Table 6, the number of words can be replaced by the number of words, and then a plurality of word count intervals and the values corresponding to each word count interval can be obtained.
在一些实施例中,在获取待处理文档中各个句子的长度后,可以对各个句子的长度进行均值化处理,得到句子平均长度;为了确定句子平均长度对应的取值,可以预先确定多个句子长度区间,每个句子长度区间对应一个取值,这样,通过对句子平均长度的离散化数据处理,可以得到特征C2的取值。In some embodiments, after obtaining the length of each sentence in the document to be processed, the length of each sentence can be averaged to obtain the average sentence length; in order to determine the value corresponding to the average sentence length, a plurality of sentences can be predetermined. Length interval, each sentence length interval corresponds to a value. In this way, the value of feature C2 can be obtained by processing the discretized data of the average length of the sentence.
在一些实施例中,可以按照表7得出句子平均长度对应的取值。In some embodiments, the value corresponding to the average sentence length can be obtained according to Table 7.
表7Table 7
句子平均长度average sentence length 特征C2的取值The value of feature C2
句子平均长度<5Average sentence length < 5 00
5≤句子平均长度<75≤Sentence average length<7 0.20.2
7≤句子平均长度<97≤Sentence average length<9 0.40.4
9≤句子平均长度<119≤Sentence average length<11 0.60.6
11≤句子平均长度<1311≤Sentence average length<13 0.80.8
13≤句子平均长度13≤sentence average length 11
在一些实施例中,在文档错误数后,可以将文档错误数作为指数函数的自变量,将指数函数的因变量的取值作为特征C3的取值;这里,指数函数的底数大于0且小于1,可以理解的是,在文档错误数越多时,特征C3的取值越小。In some embodiments, after the number of document errors, the number of document errors can be used as the independent variable of the exponential function, and the value of the dependent variable of the exponential function can be used as the value of the feature C3; here, the base of the exponential function is greater than 0 and less than 1. It is understandable that when the number of document errors is larger, the value of feature C3 is smaller.
这里,指数函数可以是以下公式(3):Here, the exponential function can be the following formula (3):
Y=R x               (3); Y=R x (3);
其中,X表文档错误数,Y表示特征C3的取值,R∈(0,1),例如,R的取值为0.9。Among them, X represents the number of document errors, Y represents the value of feature C3, and R∈(0,1), for example, the value of R is 0.9.
在一些实施例中,在获取待处理文档中高级词汇数后,可以预先确定多个高级词汇数区间,每个高级词汇数区间对应一个取值,这样,通过对高级词汇数的离散化处理,可以得到特征C4的取值;在一个示例中,当高级词汇数大于或等于20时,特征C4的取值为1。In some embodiments, after obtaining the number of advanced words in the document to be processed, a plurality of intervals of the number of advanced words may be predetermined, and each interval of the number of advanced words corresponds to a value. In this way, by discretizing the number of advanced words, The value of feature C4 can be obtained; in an example, when the number of advanced vocabulary is greater than or equal to 20, the value of feature C4 is 1.
在一些实施例中,待处理文档的单词数为700,句子平均长度为20,文档错误数为2,高级词汇数为20,句子总数为40,R的取值为0.9;则可以确定待处理文档的特征C1、特征C2、特征C3和特征C4的取值分别为0.4、1、0.81和1;在特征C1、特征C2、特征C3和特征C4的权重分别为0.4、0.2、0.2和0.2的情况下,待处理文档的质量评分值可以根据公式(1)计算得出,即,待处理文档的质量评分值为0.722;在一些实施例中,还可以将待处理文档的质量评分值乘以100,得出待处理文档在百分制下的质量评分值,这里,待处理文档在百分制下的质量评分值为72.2。In some embodiments, the number of words in the document to be processed is 700, the average sentence length is 20, the number of document errors is 2, the number of advanced vocabulary is 20, the total number of sentences is 40, and the value of R is 0.9; The values of feature C1, feature C2, feature C3 and feature C4 of the document are 0.4, 1, 0.81 and 1 respectively; the weights of feature C1, feature C2, feature C3 and feature C4 are 0.4, 0.2, 0.2 and 0.2 respectively In this case, the quality score value of the document to be processed can be calculated according to formula (1), that is, the quality score value of the document to be processed is 0.722; in some embodiments, the quality score value of the document to be processed can also be multiplied by 100, obtain the quality score value of the document to be processed under the percentile system, here, the quality score value of the document to be processed under the percentile system is 72.2.
本申请实施例可以应用于任意的文档管理场景,在待处理文档为预案文档的情况下,采用本申请实施例的文档处理方法,首先可以基于图1所示的网络通信结构,实现电子设备和第三方平台的的通信;然后,第三方平台可以将配置文件和文件包发送至电子设备,电子设备可以根据配置文件和文件包,并采用NLP等技术实现目标特征的提取;最后,基于提取的目标特征可以实现预案文档质量的评估和审计,有利于进一步优化预案文档。The embodiments of the present application can be applied to any document management scenario. In the case where the documents to be processed are pre-plan documents, using the document processing method of the embodiments of the present application, firstly, based on the network communication structure shown in FIG. 1, the electronic device and the The communication of the third-party platform; then, the third-party platform can send the configuration file and file package to the electronic device, and the electronic device can extract the target feature according to the configuration file and file package and adopt NLP and other technologies; finally, based on the extracted The target feature can realize the evaluation and audit of the quality of the plan document, which is beneficial to further optimize the plan document.
在前述实施例提出的文档处理方法的基础上,本申请实施例还提出了一种文档处理装置;图5为本申请实施例的文档处理装置的一个可选的组成结构示意图,如图5所示,该文档处理装置500可以包括:On the basis of the document processing method proposed in the foregoing embodiment, an embodiment of the present application also proposes a document processing apparatus; FIG. 5 is a schematic diagram of an optional composition structure of the document processing apparatus according to the embodiment of the present application, as shown in FIG. 5 . As shown, the document processing apparatus 500 may include:
第一获取模块501,配置为获取待处理文档;The first obtaining module 501 is configured to obtain documents to be processed;
接收模块502,配置为接收第三方平台发送的配置文件,所述配置文件包括待处理文档的目标特征的标识和所述第三方平台提供的文件包的路径信息;所述文件包包括表征所述目标特征的特征提取方法的第一信息;The receiving module 502 is configured to receive a configuration file sent by a third-party platform, where the configuration file includes an identifier of a target feature of a document to be processed and path information of a file package provided by the third-party platform; The first information of the feature extraction method of the target feature;
第二获取模块503,配置为在所述目标特征的标识与默认特征的标识不同的情况下,基于所述文件包的路径信息获取所述文件包;The second obtaining module 503 is configured to obtain the file package based on the path information of the file package when the identifier of the target feature is different from the identifier of the default feature;
处理模块504,配置为基于所述文件包中的所述第一信息,在所述待处理文档中提取出所述目标特征。The processing module 504 is configured to extract the target feature from the document to be processed based on the first information in the file package.
在本申请的一些实施例中,所述文件包包括自定义类,所述第一信息位于自定义类中;In some embodiments of the present application, the file package includes a custom class, and the first information is located in the custom class;
所述第二获取模块503,还配置为通过程序语言的反射机制,加载所述文件包中的所述自定义类,并从加载的所述自定义类中获取所述第一信息。The second obtaining module 503 is further configured to load the custom class in the file package through the reflection mechanism of the programming language, and obtain the first information from the loaded custom class.
在本申请的一些实施例中,所述配置文件还包括第二信息,所述第二信息包括:所述文件包的标识和/或所述自定义类的标识;In some embodiments of the present application, the configuration file further includes second information, where the second information includes: an identifier of the file package and/or an identifier of the custom class;
所述第二获取模块503,配置为通过程序语言的反射机制,加载所述文件包中的所述自定义类,包括:The second obtaining module 503 is configured to load the custom class in the file package through the reflection mechanism of the programming language, including:
在确定所述配置文件中的第二信息为预先与所述第三方平台约定的信息的情况下,通过所述程序语言的反射机制,加载所述文件包中的所述自定义类。In the case where it is determined that the second information in the configuration file is information pre-agreed with the third-party platform, the custom class in the file package is loaded through the reflection mechanism of the programming language.
在本申请的一些实施例中,所述第二获取模块503,还配置为获取预先设置的所述第二信息的加密方式;基于所述第二信息的加密方式对应的解密方式,对所述配置文件中的加密信息进行解密,得到所述第二信息;其中,所述加密信息是基于所述加密方式对所述第二信息进行加密得到的。In some embodiments of the present application, the second obtaining module 503 is further configured to obtain a preset encryption mode of the second information; based on the decryption mode corresponding to the encryption mode of the second information, The encrypted information in the configuration file is decrypted to obtain the second information; wherein the encrypted information is obtained by encrypting the second information based on the encryption method.
在本申请的一些实施例中,所述第二获取模块503,还配置为预先确定抽象类,设置所述自定义类继承所述预先确定的抽象类;In some embodiments of the present application, the second obtaining module 503 is further configured to predetermine an abstract class, and set the custom class to inherit the predetermined abstract class;
所述第二获取模块503,配置为从加载的所述自定义类中获取所述第一信息,包括:The second obtaining module 503 is configured to obtain the first information from the loaded custom class, including:
将所述自定义类实例化为对象,在所述对象属于所述抽象类的情况下,从加载的所述自定义类中获取所述第一信息。The custom class is instantiated as an object, and when the object belongs to the abstract class, the first information is obtained from the loaded custom class.
在本申请的一些实施例中,所述处理模块504,还配置为在所述目标特征的标识与默认特征的标识相同的情况下,基于预先确定的所述默认特征的提取方式,在所述待处理文档中提取出所述目标特征。In some embodiments of the present application, the processing module 504 is further configured to, in the case that the identifier of the target feature is the same as the identifier of the default feature, based on the predetermined extraction method of the default feature, in the The target feature is extracted from the document to be processed.
在本申请的一些实施例中,所述处理模块504,还配置为基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值。In some embodiments of the present application, the processing module 504 is further configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed.
在本申请的一些实施例中,所述目标特征包括至少两个特征;所述配置文件包括所述至少两个特征中每个特征的权重信息;In some embodiments of the present application, the target feature includes at least two features; the configuration file includes weight information of each of the at least two features;
所述处理模块504,配置为基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值,包括:The processing module 504 is configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed, including:
基于所述至少两个特征中各个特征的权重信息,对所述至少两个特征中各个特征进行加权求和运算,得出所述待处理文档的质量评分值。Based on the weight information of each of the at least two features, a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
在本申请的一些实施例中,所述处理模块504,配置为在所述待处理文档中提取出所述目标特征,包括:In some embodiments of the present application, the processing module 504 is configured to extract the target feature from the document to be processed, including:
将所述待处理文档的字数按照预先确定的多个字数区间进行离散化数据处理,得到长度相关特征,每个所述字数区间对应一个取值;提取所述待处理文档的文档特征向量,将所述待处理文档的文档特征向量与预设模板的文档特征向量的余弦相似度作为模板相关特征;根据所述待处理文档中预设词性的词占待处理文档所有词的数量比例,确定词性相关特征;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain length-related features, each of which corresponds to a value; the document feature vector of the to-be-processed document is extracted, and the The cosine similarity between the document feature vector of the document to be processed and the document feature vector of the preset template is used as the template correlation feature; the part of speech is determined according to the number ratio of the preset part of speech in the document to be processed to all words in the document to be processed. relevant features;
将长度相关特征、模板相关特征和词性相关特征中的至少两个作为所述目标特征。At least two of length-related features, template-related features, and part-of-speech-related features are used as the target features.
在本申请的一些实施例中,所述处理模块504,配置为在所述待处理文档中提取出所述目标特征,包括:In some embodiments of the present application, the processing module 504 is configured to extract the target feature from the document to be processed, including:
将所述待处理文档的单词数按照预先确定的多个单词数区间进行离散化数据处理,得到第一特征,每个所述单词数区间对应一个取值;将所述待处理文档的句子平均长度按照预先确定的多个句子长度区间进行离散化数据处理,得到第二特征,每个所述句子长度区间对应一个取值;以所述待处理文档的文档错误数作为指数函数的自变量,得出所述指数函数的取值,将所述指数函数的取值作为所述第三特征;将所述待处理文档的高级词汇数按照预先确定的多个高级词汇数区间进行离散化数据处理,得到第四特征,每个所述高级词汇数区间对应一个取值,所述高级词汇表示位于预先确定的高级词汇表中的词汇;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain a first feature, and each of the word count intervals corresponds to a value; the sentences of the document to be processed are averaged The length is discretized data processing according to a plurality of predetermined sentence length intervals to obtain the second feature, and each sentence length interval corresponds to a value; the document error number of the document to be processed is used as the independent variable of the exponential function, Obtain the value of the exponential function, and use the value of the exponential function as the third feature; perform discretization data processing on the number of advanced words of the document to be processed according to a plurality of predetermined intervals of the number of advanced words , to obtain the fourth feature, each of the high-level vocabulary count intervals corresponds to a value, and the high-level vocabulary represents a vocabulary located in a predetermined high-level vocabulary;
将所述第一特征、第二特征、第三特征和第四特征中的至少两个作为所述目标特征。At least two of the first feature, the second feature, the third feature, and the fourth feature are used as the target feature.
在实际应用中,第一获取模块501、接收模块502、第二获取模块503和处理模块504均可以利用处理器实现,上述处理器可以是ASIC、DSP、DSPD、PLD、FPGA、CPU、控制器、微控制器、微处理器中的至少一种。可以理解地,实现上述处理器功能的电子器件还可以为其它,本申请实施例不作限制。In practical applications, the first acquisition module 501, the receiving module 502, the second acquisition module 503, and the processing module 504 can all be implemented by processors, and the above processors can be ASIC, DSP, DSPD, PLD, FPGA, CPU, controller , at least one of a microcontroller and a microprocessor. It can be understood that the electronic device that implements the function of the above processor may also be other, which is not limited in the embodiment of the present application.
需要说明的是,以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。It should be noted that the descriptions of the above apparatus embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to the method embodiments. For technical details not disclosed in the device embodiments of the present application, please refer to the descriptions of the method embodiments of the present application for understanding.
需要说明的是,本申请实施例中,如果以软件功能模块的形式实现上述的文档处理方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可 以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是终端、服务器等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本申请实施例不限制于任何特定的硬件和软件结合。It should be noted that, in the embodiments of the present application, if the above-mentioned document processing method is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application can be embodied in the form of software products in essence or in the parts that make contributions to the prior art. The computer software products are stored in a storage medium and include several instructions for A computer device (which may be a terminal, a server, etc.) is caused to execute all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: a U disk, a mobile hard disk, a read only memory (Read Only Memory, ROM), a magnetic disk or an optical disk and other media that can store program codes. As such, the embodiments of the present application are not limited to any specific combination of hardware and software.
对应地,本申请实施例再提供一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,该计算机可执行指令用于实现本申请实施例提供的任意一种文档处理方法。Correspondingly, the embodiments of the present application further provide a computer program product, where the computer program product includes computer-executable instructions, and the computer-executable instructions are used to implement any one of the document processing methods provided by the embodiments of the present application.
相应的,本申请实施例再提供一种计算机存储介质,所述计算机存储介质上存储有计算机可执行指令,该计算机可执行指令用于实现上述实施例提供的任意一种文档处理方法。Correspondingly, an embodiment of the present application further provides a computer storage medium, where computer-executable instructions are stored on the computer storage medium, and the computer-executable instructions are used to implement any one of the document processing methods provided in the foregoing embodiments.
本申请实施例还提供一种电子设备,图6为本申请实施例提供的电子设备的一个可选的组成结构示意图,如图6所示,所述电子设备60包括:An embodiment of the present application further provides an electronic device, and FIG. 6 is an optional structural schematic diagram of the electronic device provided by the embodiment of the present application. As shown in FIG. 6 , the electronic device 60 includes:
存储器601,配置为存储可执行指令; memory 601, configured to store executable instructions;
处理器602,配置为执行所述存储器601中存储的可执行指令时,实现上述任意一种文档处理方法。The processor 602 is configured to implement any one of the above document processing methods when executing the executable instructions stored in the memory 601 .
上述处理器602可以为ASIC、DSP、DSPD、PLD、FPGA、CPU、控制器、微控制器、微处理器中的至少一种。The above-mentioned processor 602 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor.
上述计算机可读存储介质/存储器可以是只读存储器(Read Only Memory,ROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性随机存取存储器(Ferromagnetic Random Access Memory,FRAM)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(Compact Disc Read-Only Memory,CD-ROM)等存储器;也可以是包括上述存储器之一或任意组合的各种终端,如移动电话、计算机、平板设备、个人数字助理等。The above-mentioned computer-readable storage medium/memory can be a read-only memory (Read Only Memory, ROM), a programmable read-only memory (Programmable Read-Only Memory, PROM), an erasable programmable read-only memory (Erasable Programmable Read-Only Memory) Memory, EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Magnetic Random Access Memory (FRAM), Flash Memory (Flash Memory), Magnetic Surface Memory, optical disk, or memory such as Compact Disc Read-Only Memory (CD-ROM); it can also be various terminals including one or any combination of the above memories, such as mobile phones, computers, tablet devices, personal digital Assistant etc.
这里需要指出的是:以上存储介质和设备实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请存储介质和设备实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。It should be pointed out here that the descriptions of the above storage medium and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to the method embodiments. For technical details not disclosed in the embodiments of the storage medium and device of the present application, please refer to the description of the method embodiments of the present application to understand.
应理解,说明书通篇中提到的“一些实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一些实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。It is to be understood that reference throughout the specification to "some embodiments" means that a particular feature, structure or characteristic associated with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of "in some embodiments" in various places throughout this specification are not necessarily necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation. The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨 论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms. of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本申请实施例方案的目的。The unit described above as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit; it may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
另外,在本申请各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may all be integrated into one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; the above integration The unit can be implemented either in the form of hardware or in the form of hardware plus software functional units.
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得设备自动测试线执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated units of the present application are implemented in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application may be embodied in the form of software products in essence or the parts that make contributions to related technologies. The computer software products are stored in a storage medium and include several instructions to make The automatic test line of the device performs all or part of the methods described in the various embodiments of the present application. The aforementioned storage medium includes various media that can store program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.
本申请所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。The methods disclosed in the several method embodiments provided in this application can be arbitrarily combined under the condition of no conflict to obtain new method embodiments.
本申请所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。The features disclosed in several method or device embodiments provided in this application can be combined arbitrarily without conflict to obtain new method embodiments or device embodiments.
以上所述,仅为本申请的实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only the embodiment of the present application, but the protection scope of the present application is not limited to this. Covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.
工业实用性Industrial Applicability
本申请实施例提供了一种文档处理方法、装置、设备及计算机可读存储介质;该方法包括:获取待处理文档;接收第三方平台发送的配置文件,所述配置文件包括待处理文档的目标特征的标识和所述第三方平台提供的文件包的路径信息;所述文件包包括表征所述目标特征的特征提取方法的第一信息;在所述目标特征的标识与默认特征的标识不同的情况下,基于所述文件包的路径信息获取所述文件包;基于所述文件包中的所述第一信息,在所述待处理文档中提取出所述目标特征。可以看出,在本申请实施例中,在需要提取待处理文档的目标特征且目标特征不是默认特征的情况下,为了实现目标特征提取,不需要在本地编写并运行的新的程序代码,而是可以直接从第三方平台获取目标特征的提取方法,在一定程度上降低了时间成本和人力成本Embodiments of the present application provide a document processing method, apparatus, device, and computer-readable storage medium; the method includes: acquiring a document to be processed; receiving a configuration file sent by a third-party platform, where the configuration file includes a target of the document to be processed The identifier of the feature and the path information of the file package provided by the third-party platform; the file package includes the first information representing the feature extraction method of the target feature; if the identifier of the target feature is different from that of the default feature In this case, the file package is acquired based on the path information of the file package; and the target feature is extracted from the to-be-processed document based on the first information in the file package. It can be seen that, in the embodiment of the present application, in the case where the target feature of the document to be processed needs to be extracted and the target feature is not the default feature, in order to achieve the target feature extraction, no new program code written and run locally is not required, but It is an extraction method that can directly obtain target features from third-party platforms, which reduces time and labor costs to a certain extent.

Claims (23)

  1. 一种文档处理方法,所述方法包括:A document processing method, the method comprising:
    获取待处理文档;Get pending documents;
    接收第三方平台发送的配置文件,所述配置文件包括待处理文档的目标特征的标识和所述第三方平台提供的文件包的路径信息;所述文件包包括表征所述目标特征的特征提取方法的第一信息;Receive a configuration file sent by a third-party platform, where the configuration file includes the identifier of the target feature of the document to be processed and the path information of the file package provided by the third-party platform; the file package includes a feature extraction method that characterizes the target feature first information;
    在所述目标特征的标识与默认特征的标识不同的情况下,基于所述文件包的路径信息获取所述文件包;When the identifier of the target feature is different from the identifier of the default feature, acquiring the file package based on the path information of the file package;
    基于所述文件包中的所述第一信息,在所述待处理文档中提取出所述目标特征。Based on the first information in the file package, the target feature is extracted from the document to be processed.
  2. 根据权利要求1所述的文档处理方法,其中,所述文件包包括自定义类,所述第一信息位于自定义类中;The document processing method according to claim 1, wherein the file package includes a custom class, and the first information is located in the custom class;
    所述方法还包括:通过程序语言的反射机制,加载所述文件包中的所述自定义类,并从加载的所述自定义类中获取所述第一信息。The method further includes: loading the custom class in the file package through a reflection mechanism of a programming language, and acquiring the first information from the loaded custom class.
  3. 根据权利要求2所述的文档处理方法,其中,所述配置文件还包括第二信息,所述第二信息包括:所述文件包的标识和/或所述自定义类的标识;The document processing method according to claim 2, wherein the configuration file further includes second information, and the second information includes: an identifier of the file package and/or an identifier of the custom class;
    所述通过程序语言的反射机制,加载所述文件包中的所述自定义类,包括:The loading of the custom class in the file package through the reflection mechanism of the programming language includes:
    在确定所述配置文件中的第二信息为预先与所述第三方平台约定的信息的情况下,通过所述程序语言的反射机制,加载所述文件包中的所述自定义类。In the case where it is determined that the second information in the configuration file is information pre-agreed with the third-party platform, the custom class in the file package is loaded through the reflection mechanism of the programming language.
  4. 根据权利要求3所述的文档处理方法,其中,所述方法还包括:The document processing method according to claim 3, wherein the method further comprises:
    获取预先设置的所述第二信息的加密方式;obtaining a preset encryption method of the second information;
    基于所述第二信息的加密方式对应的解密方式,对所述配置文件中的加密信息进行解密,得到所述第二信息;其中,所述加密信息是基于所述加密方式对所述第二信息进行加密得到的。Decrypt the encrypted information in the configuration file based on the decryption method corresponding to the encryption method of the second information to obtain the second information; wherein the encrypted information is based on the encryption method to the second information. information is encrypted.
  5. 根据权利要求2所述的文档处理方法,其中,所述方法还包括:The document processing method according to claim 2, wherein the method further comprises:
    预先确定抽象类,设置所述自定义类继承所述预先确定的抽象类;Predetermining an abstract class, and setting the custom class to inherit the predetermined abstract class;
    所述从加载的所述自定义类中获取所述第一信息,包括:The obtaining the first information from the loaded custom class includes:
    将所述自定义类实例化为对象,在所述对象属于所述抽象类的情况下,从加载的所述自定义类中获取所述第一信息。The custom class is instantiated as an object, and when the object belongs to the abstract class, the first information is obtained from the loaded custom class.
  6. 根据权利要求1所述的文档处理方法,其中,所述方法还包括:The document processing method according to claim 1, wherein the method further comprises:
    在所述目标特征的标识与默认特征的标识相同的情况下,基于预先确定的所述默认特征的提取方式,在所述待处理文档中提取出所述目标特征。In the case that the identifier of the target feature is the same as the identifier of the default feature, the target feature is extracted from the document to be processed based on a predetermined extraction method of the default feature.
  7. 根据权利要求1至6任一项所述的文档处理方法,其中,所述方法还包括:The document processing method according to any one of claims 1 to 6, wherein the method further comprises:
    基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值。A quality score is performed on the document to be processed based on the target feature, and a quality score value of the document to be processed is obtained.
  8. 根据权利要求7所述的文档处理方法,其中,所述目标特征包括至少两个特征;所述配置文件包括所述至少两个特征中每个特征的权重信息;The document processing method according to claim 7, wherein the target feature includes at least two features; the configuration file includes weight information of each of the at least two features;
    所述基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值,包括:Performing a quality score on the document to be processed based on the target feature to obtain a quality score value of the document to be processed, including:
    基于所述至少两个特征中各个特征的权重信息,对所述至少两个特征中各个特征进行加权求和运算,得出所述待处理文档的质量评分值。Based on the weight information of each of the at least two features, a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
  9. 根据权利要求8所述的文档处理方法,其中,所述在所述待处理文档中提取出所述目标特征,包括:The document processing method according to claim 8, wherein the extracting the target feature from the document to be processed comprises:
    将所述待处理文档的字数按照预先确定的多个字数区间进行离散化数据处理,得到长度相关特征,每个所述字数区间对应一个取值;提取所述待处理文档的文档特征向量,将所述待处理文档的文档特征向量与预设模板的文档特征向量的余弦相似度作为模板相关特征;根据所述待处理文档中预设词性的词占待处理文档所有词的数量比例,确定词性相关特征;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain length-related features, each of which corresponds to a value; the document feature vector of the to-be-processed document is extracted, and the The cosine similarity between the document feature vector of the document to be processed and the document feature vector of the preset template is used as the template correlation feature; the part of speech is determined according to the number ratio of the preset part of speech in the document to be processed to all words in the document to be processed. relevant features;
    将长度相关特征、模板相关特征和词性相关特征中的至少两个作为所述目标特征。At least two of length-related features, template-related features, and part-of-speech-related features are used as the target features.
  10. 根据权利要求8所述的文档处理方法,其中,所述在所述待处理文档中提取出所述目标特征,包括:The document processing method according to claim 8, wherein the extracting the target feature from the document to be processed comprises:
    将所述待处理文档的单词数按照预先确定的多个单词数区间进行离散化数据处理,得到第一特征,每个所述单词数区间对应一个取值;将所述待处理文档的句子平均长度按照预先确定的多个句子长度区间进行离散化数据处理,得到第二特征,每个所述句子长度区间对应一个取值;以所述待处理文档的文档错误数作为指数函数的自变量,得出所述指数函数的取值,将所述指数函数的取值作为所述第三特征;将所述待处理文档的高级词汇数按照预先确定的多个高级词汇数区间进行离散化数据处理,得到第四特征,每个所述高级词汇数区间对应一个取值,所述高级词汇表示位于预先确定的高级词汇表中的词汇;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain a first feature, and each of the word count intervals corresponds to a value; the sentences of the document to be processed are averaged The length is discretized data processing according to a plurality of predetermined sentence length intervals to obtain a second feature, and each sentence length interval corresponds to a value; the document error number of the document to be processed is used as the independent variable of the exponential function, Obtain the value of the exponential function, and use the value of the exponential function as the third feature; perform discretization data processing on the number of advanced words of the document to be processed according to a plurality of predetermined intervals of the number of advanced words , to obtain the fourth feature, each of the high-level vocabulary count intervals corresponds to a value, and the high-level vocabulary represents a vocabulary located in a predetermined high-level vocabulary;
    将所述第一特征、第二特征、第三特征和第四特征中的至少两个作为所述目标特征。At least two of the first feature, the second feature, the third feature, and the fourth feature are used as the target feature.
  11. 一种文档处理装置,所述装置包括:A document processing device comprising:
    第一获取模块,配置为获取待处理文档;The first obtaining module is configured to obtain the document to be processed;
    接收模块,配置为接收第三方平台发送的配置文件,所述配置文件包括待处理文档的目标特征的标识和所述第三方平台提供的文件包的路径信息;所述文件包包括表征所述目标特征的特征提取方法的第一信息;a receiving module, configured to receive a configuration file sent by a third-party platform, the configuration file includes an identifier of a target feature of the document to be processed and path information of a file package provided by the third-party platform; the file package includes a file representing the target the first information of the feature extraction method of the feature;
    第二获取模块,配置为在所述目标特征的标识与默认特征的标识不同的情况下,基于所述文件包的路径信息获取所述文件包;a second acquiring module, configured to acquire the file package based on the path information of the file package when the identifier of the target feature is different from the identifier of the default feature;
    处理模块,配置为基于所述文件包中的所述第一信息,在所述待处理文档中提取出所述目标特征。A processing module, configured to extract the target feature from the document to be processed based on the first information in the file package.
  12. 根据权利要求11所述的装置,其中,所述文件包包括自定义类,所述第一信息位于自定义类中;The apparatus of claim 11, wherein the file package includes a custom class, and the first information is located in the custom class;
    所述第二获取模块,还配置为通过程序语言的反射机制,加载所述文件包中的所述自定义类,并从加载的所述自定义类中获取所述第一信息。The second obtaining module is further configured to load the custom class in the file package through the reflection mechanism of the programming language, and obtain the first information from the loaded custom class.
  13. 根据权利要求12所述的装置,其中,所述配置文件还包括第二信息,所述第二信息包括:所述文件包的标识和/或所述自定义类的标识;The apparatus according to claim 12, wherein the configuration file further includes second information, the second information including: an identifier of the file package and/or an identifier of the custom class;
    所述第二获取模块,配置为通过程序语言的反射机制,加载所述文件包中的所述自定义类,包括:The second acquisition module is configured to load the custom class in the file package through the reflection mechanism of the programming language, including:
    在确定所述配置文件中的第二信息为预先与所述第三方平台约定的信息的情况下,通过所述程序语言的反射机制,加载所述文件包中的所述自定义类。In the case where it is determined that the second information in the configuration file is information pre-agreed with the third-party platform, the custom class in the file package is loaded through the reflection mechanism of the programming language.
  14. 根据权利要求13所述的装置,其中,所述第二获取模块,还配置为获取预先设置的所述第二信息的加密方式;基于所述第二信息的加密方式对应的解密方式,对所述配置文件中的加密信息进行解密,得到所述第二信息;其中,所述加密信息是基于所述加密方式对所述第二信息进行加密得到的。The device according to claim 13, wherein the second obtaining module is further configured to obtain a preset encryption method of the second information; based on the decryption method corresponding to the encryption method of the second information, the The encrypted information in the configuration file is decrypted to obtain the second information; wherein, the encrypted information is obtained by encrypting the second information based on the encryption method.
  15. 根据权利要求12所述的装置,其中,所述第二获取模块,还配置为预先确定抽象类,设置所述自定义类继承所述预先确定的抽象类;The apparatus according to claim 12, wherein the second obtaining module is further configured to predetermine an abstract class, and set the custom class to inherit the predetermined abstract class;
    所述第二获取模块,配置为从加载的所述自定义类中获取所述第一信息,包括:The second obtaining module is configured to obtain the first information from the loaded custom class, including:
    将所述自定义类实例化为对象,在所述对象属于所述抽象类的情况下,从加载的所述自定义类中获取所述第一信息。The custom class is instantiated as an object, and when the object belongs to the abstract class, the first information is obtained from the loaded custom class.
  16. 根据权利要求11所述的装置,其中,所述处理模块,还配置为在所述目标特征的标识与默认特征的标识相同的情况下,基于预先确定的所述默认特征的提取方式,在所述待处理文档中提取出所述目标特征。The apparatus according to claim 11, wherein the processing module is further configured to, in the case that the identification of the target feature is the same as the identification of the default feature, based on a predetermined extraction method of the default feature The target feature is extracted from the document to be processed.
  17. 根据权利要求11至16任一项所述的装置,其中,所述处理模块,还配置为基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值。The apparatus according to any one of claims 11 to 16, wherein the processing module is further configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed .
  18. 根据权利要求17所述的装置,其中,所述目标特征包括至少两个特征;所述配置文件包括所述至少两个特征中每个特征的权重信息;The apparatus of claim 17, wherein the target feature includes at least two features; the configuration file includes weight information for each of the at least two features;
    所述处理模块,配置为基于所述目标特征对所述待处理文档进行质量评分,得出所述待处理文档的质量评分值,包括:The processing module is configured to perform a quality score on the document to be processed based on the target feature, and obtain a quality score value of the document to be processed, including:
    基于所述至少两个特征中各个特征的权重信息,对所述至少两个特征中各个特征进行加权求和运算,得出所述待处理文档的质量评分值。Based on the weight information of each of the at least two features, a weighted sum operation is performed on each of the at least two features to obtain a quality score value of the document to be processed.
  19. 根据权利要求18所述的装置,其中,所述处理模块,配置为在所述待处理文档中提取出所述目标特征,包括:The apparatus according to claim 18, wherein the processing module, configured to extract the target feature from the document to be processed, comprises:
    将所述待处理文档的字数按照预先确定的多个字数区间进行离散化数据处理,得到长度相关特征,每个所述字数区间对应一个取值;提取所述待处理文档的文档特征向量,将所述待处理文档的文档特征向量与预设模板的文档特征向量的余弦相似度作为模板相关特征;根据所述待处理文档中预设词性的词占待处理文档所有词的数量比例,确定词性相关特征;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain length-related features, each of which corresponds to a value; the document feature vector of the to-be-processed document is extracted, and the The cosine similarity between the document feature vector of the document to be processed and the document feature vector of the preset template is used as the template correlation feature; the part of speech is determined according to the number ratio of the preset part of speech in the document to be processed to all words in the document to be processed. relevant features;
    将长度相关特征、模板相关特征和词性相关特征中的至少两个作为所述目标特征。At least two of length-related features, template-related features, and part-of-speech-related features are used as the target features.
  20. 根据权利要求18所述的装置,其中,所述处理模块,配置为在所述待处理文档中提取出所述目标特征,包括:The apparatus according to claim 18, wherein the processing module, configured to extract the target feature from the document to be processed, comprises:
    将所述待处理文档的单词数按照预先确定的多个单词数区间进行离散化数据处理,得到第一特征,每个所述单词数区间对应一个取值;将所述待处理文档的句子平均长度按照预先确定的多个句子长度区间进行离散化数据处理,得到第二特征,每个所述句子长度区间对应一个取值;以所述待处理文档的文档错误数作为指数函数的自变量,得出所述指数函数的取值,将所述指数函数的取值作为所述第三特征;将所述待处理文档的高级词汇数按照预先确定的多个高级词汇数区间进行离散化数据处理,得到第四特征,每个所述高级词汇数区间对应一个取值,所述高级词汇表示位于预先确定的高级词汇表中的词汇;The word count of the document to be processed is subjected to discretization data processing according to a plurality of predetermined word count intervals to obtain a first feature, and each of the word count intervals corresponds to a value; the sentences of the document to be processed are averaged The length is discretized data processing according to a plurality of predetermined sentence length intervals to obtain a second feature, and each sentence length interval corresponds to a value; the document error number of the document to be processed is used as the independent variable of the exponential function, Obtain the value of the exponential function, and use the value of the exponential function as the third feature; perform discretization data processing on the number of advanced words of the document to be processed according to a plurality of predetermined intervals of the number of advanced words , to obtain the fourth feature, each of the high-level vocabulary count intervals corresponds to a value, and the high-level vocabulary represents a vocabulary located in a predetermined high-level vocabulary;
    将所述第一特征、第二特征、第三特征和第四特征中的至少两个作为所述目标特征。At least two of the first feature, the second feature, the third feature, and the fourth feature are used as the target feature.
  21. 一种电子设备,所述电子设备包括:An electronic device comprising:
    存储器,配置为存储可执行指令;a memory configured to store executable instructions;
    处理器,配置为执行所述存储器中存储的可执行指令时,实现权利要求1至10任一项所述的文档处理方法。The processor, when configured to execute the executable instructions stored in the memory, implements the document processing method according to any one of claims 1 to 10.
  22. 一种计算机可读存储介质,存储有可执行指令,配置为被处理器执行时,实现权利要求1至10任一项所述的文档处理方法。A computer-readable storage medium storing executable instructions configured to implement the document processing method according to any one of claims 1 to 10 when executed by a processor.
  23. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至10任一所述的文档处理方法。A computer program, comprising computer-readable codes, when the computer-readable codes are executed in an electronic device, a processor in the electronic device executes the document processing method for implementing any one of claims 1 to 10 .
PCT/CN2021/083679 2020-08-28 2021-03-29 Document processing method and apparatus, electronic device, storage medium, and program WO2022041714A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010884957.2A CN112099870B (en) 2020-08-28 2020-08-28 Document processing method, device, electronic equipment and computer readable storage medium
CN202010884957.2 2020-08-28

Publications (1)

Publication Number Publication Date
WO2022041714A1 true WO2022041714A1 (en) 2022-03-03

Family

ID=73758247

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083679 WO2022041714A1 (en) 2020-08-28 2021-03-29 Document processing method and apparatus, electronic device, storage medium, and program

Country Status (2)

Country Link
CN (1) CN112099870B (en)
WO (1) WO2022041714A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662270A (en) * 2022-09-09 2023-08-29 荣耀终端有限公司 File analysis method and related device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112099870B (en) * 2020-08-28 2023-12-26 深圳前海微众银行股份有限公司 Document processing method, device, electronic equipment and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589918A (en) * 2015-09-17 2016-05-18 广州市动景计算机科技有限公司 Method and device for extracting page information
CN111178057A (en) * 2020-01-02 2020-05-19 大汉软件股份有限公司 Content analysis and extraction system of government affair electronic document
US20200184210A1 (en) * 2018-12-06 2020-06-11 International Business Machines Corporation Multi-modal document feature extraction
CN112099870A (en) * 2020-08-28 2020-12-18 深圳前海微众银行股份有限公司 Document processing method and device, electronic equipment and computer readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8533208B2 (en) * 2009-09-28 2013-09-10 Ebay Inc. System and method for topic extraction and opinion mining
CN108920656A (en) * 2018-07-03 2018-11-30 龙马智芯(珠海横琴)科技有限公司 Document properties description content extracting method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589918A (en) * 2015-09-17 2016-05-18 广州市动景计算机科技有限公司 Method and device for extracting page information
US20200184210A1 (en) * 2018-12-06 2020-06-11 International Business Machines Corporation Multi-modal document feature extraction
CN111178057A (en) * 2020-01-02 2020-05-19 大汉软件股份有限公司 Content analysis and extraction system of government affair electronic document
CN112099870A (en) * 2020-08-28 2020-12-18 深圳前海微众银行股份有限公司 Document processing method and device, electronic equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662270A (en) * 2022-09-09 2023-08-29 荣耀终端有限公司 File analysis method and related device
CN116662270B (en) * 2022-09-09 2024-05-10 荣耀终端有限公司 File analysis method and related device

Also Published As

Publication number Publication date
CN112099870A (en) 2020-12-18
CN112099870B (en) 2023-12-26

Similar Documents

Publication Publication Date Title
US8161548B1 (en) Malware detection using pattern classification
US8413130B2 (en) System and method for self policing of authorized configuration by end points
US20210248234A1 (en) Malware Clustering Based on Function Call Graph Similarity
US10382620B1 (en) Protecting confidential conversations on devices
US11258611B2 (en) Trusted data verification
US10897520B2 (en) Connected contact identification
US20200153765A1 (en) Contact information extraction and identification
US11586735B2 (en) Malware clustering based on analysis of execution-behavior reports
US10291628B2 (en) Cognitive detection of malicious documents
WO2021196935A1 (en) Data checking method and apparatus, electronic device, and storage medium
WO2022041714A1 (en) Document processing method and apparatus, electronic device, storage medium, and program
WO2022095518A1 (en) Automatic interface test method and apparatus, and computer device and storage medium
EP3105677B1 (en) Systems and methods for informing users about applications available for download
US20230118341A1 (en) Inline validation of machine learning models
CN114595689A (en) Data processing method, data processing device, storage medium and computer equipment
US10303669B1 (en) Simulating hierarchical structures in key value stores
US9021389B1 (en) Systems and methods for end-user initiated data-loss-prevention content analysis
CN109522683B (en) Software tracing method, system, computer equipment and storage medium
CN105354506B (en) The method and apparatus of hidden file
US20190340542A1 (en) Computational Efficiency in Symbolic Sequence Analytics Using Random Sequence Embeddings
CN117009989A (en) Language model protection method and device and computing device cluster
US12026249B2 (en) Methods, media, and systems for screening malicious content from a web browser
US10826923B2 (en) Network security tool
CN111367898B (en) Data processing method, device, system, electronic equipment and storage medium
RU2580027C1 (en) System and method of generating rules for searching data used for phishing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21859563

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.06.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21859563

Country of ref document: EP

Kind code of ref document: A1