CN114120341A

CN114120341A - Resume document identification model training method, resume document identification method and device

Info

Publication number: CN114120341A
Application number: CN202111426081.8A
Authority: CN
Inventors: 王得贤; 李长亮; 毛璐
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-03-01

Abstract

The application provides a resume document identification model training method, a resume document identification method and a resume document identification device, wherein the resume document identification model training method comprises the following steps: the method comprises the steps of obtaining a sample set, wherein the sample set comprises a plurality of sample documents, extracting any sample document, segmenting the sample document, obtaining segmentation characteristics of the sample document based on a segmentation result, training a preset model based on the segmentation characteristics of each sample document to obtain a resume document identification model, effectively obtaining sample information by segmenting the sample document and extracting the segmentation characteristics of the sample document, avoiding information loss, training the preset model by using the segmentation characteristics of each sample document to obtain the resume document identification model, and accordingly improving efficiency and accuracy of resume document identification.

Description

Resume document identification model training method, resume document identification method and device

Technical Field

The application relates to the technical field of computers, in particular to a resume document recognition model training method. The application also relates to a resume document identification model training device, a resume document identification method, a resume document identification device, a computing device and a computer readable storage medium.

Background

With the development of internet technology, the identification of resume documents in daily office work is performed more and more depending on the internet, and the resume document identification is to perform intelligent identification on the documents, so that whether the documents belong to the process of the resume documents or not is judged, and the office efficiency can be effectively improved.

Currently, rule-based methods are commonly employed to identify resume documents: the method comprises the steps of manually presetting a database containing key words of categories such as user study information, work information, prize winning information and the like, extracting all key words of a document to be identified, and identifying the document to be identified as a resume document if at least one key word contained in each category in the database exists in all the key words.

However, in the rule-based method, a large number of rules need to be manually set, and uncertain factors are inevitably introduced manually, so that the accuracy of the resume document identification result is poor, and the efficiency of resume document identification is low due to a matching process of one keyword.

Disclosure of Invention

In view of this, embodiments of the present application provide a resume document identification model training method and a resume document identification method to solve technical defects in the prior art. The embodiment of the application also provides a resume document identification model training device, a resume document identification device, computing equipment and a computer readable storage medium.

According to a first aspect of the embodiments of the present application, there is provided a resume document recognition model training method, including:

obtaining a sample set, wherein the sample set comprises a plurality of sample documents, and the sample documents comprise resume sample documents;

extracting a first sample document in the sample set, segmenting words of the first sample document, and obtaining segmentation characteristics of the first sample document based on a segmentation result, wherein the first sample document is any sample document in the sample set;

and training a preset model based on the word segmentation characteristics of each sample document in the sample set to obtain a resume document identification model.

Optionally, the sample documents further include a non-resume sample document, where the resume sample document carries a positive sample tag whose characterization is a resume document, and the non-resume sample document carries a negative sample tag whose characterization is a non-resume document;

training a preset model based on the word segmentation characteristics of each sample document in the sample set to obtain a resume document identification model, wherein the method comprises the following steps of:

extracting word segmentation characteristics of the first sample document, and inputting the word segmentation characteristics of the first sample document into a preset model to obtain an identification result of whether the first sample document is a resume document;

calculating a loss value according to the identification result of the first sample document and the label carried by the first sample document;

if the loss value is larger than the preset threshold value, adjusting model parameters of a preset model, returning to execute the step of extracting the word segmentation characteristics of the first sample document, inputting the word segmentation characteristics of the first sample document into the preset model, and obtaining the identification result of whether the first sample document is a resume document;

and if the loss value is less than or equal to the preset threshold value, stopping training and determining that the current preset model is the resume document identification model.

Optionally, the step of calculating a loss value according to the recognition result of the first sample document and the tag carried by the first sample document includes:

and calculating the cross entropy between the identification result of the first sample document and the label carried by the first sample document as a loss value by using a cross entropy loss function according to the identification result of the first sample document and the label carried by the first sample document.

Optionally, the preset model is a logistic regression model.

Optionally, the step of tokenizing the first sample document comprises:

and calling a preset word segmentation component, and segmenting words of the first sample document file by using the preset word segmentation component to obtain all words in the first sample document file.

Optionally, the word segmentation features include features of words in the first sample document;

the step of obtaining the word segmentation characteristics of the first sample text document based on the word segmentation result comprises the following steps:

counting the occurrence frequency of each word in the word segmentation result in the first sample document, the total word number in the first sample document, the total number of sample documents in the sample set, and the number of sample documents containing the word in the sample set aiming at any word;

calculating the word frequency characteristics of each word according to the occurrence frequency of each word in the first sample document and the total word number;

calculating the anti-document frequency characteristics of each term according to the total number of the sample documents in the sample set and the number of the sample documents containing the term in the sample set aiming at any term;

and aiming at any word, determining the characteristics of the word according to the word frequency characteristics and the anti-document frequency characteristics of the word.

According to a second aspect of the embodiments of the present application, there is provided a resume document identification method, including:

acquiring a target document to be identified;

and inputting the target document into the resume document identification model obtained by training by using the method provided by the first aspect of the embodiment of the application, and obtaining an identification result of whether the target document is the resume document.

According to a third aspect of the embodiments of the present application, there is provided a resume document recognition model training device, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is configured to acquire a sample set, the sample set comprises a plurality of sample documents, and the sample documents comprise resume sample documents;

the word segmentation module is configured to extract a first sample document in the sample set, segment words of the first sample document, and obtain word segmentation characteristics of the first sample document based on a word segmentation result, wherein the first sample document is any sample document in the sample set;

and the training module is configured to train the preset model based on the word segmentation characteristics of all sample documents in the sample set to obtain the resume document identification model.

the training module is further configured to extract the word segmentation characteristics of the first sample document, input the word segmentation characteristics of the first sample document into a preset model, and obtain a recognition result of whether the first sample document is a resume document; calculating a loss value according to the identification result of the first sample document and the label carried by the first sample document; if the loss value is larger than the preset threshold value, adjusting model parameters of a preset model, returning to execute the step of extracting the word segmentation characteristics of the first sample document, inputting the word segmentation characteristics of the first sample document into the preset model, and obtaining the identification result of whether the first sample document is a resume document; and if the loss value is less than or equal to the preset threshold value, stopping training and determining that the current preset model is the resume document identification model.

Optionally, the training module is further configured to calculate, as the loss value, cross entropy between the recognition result of the first sample document and the tag carried by the first sample document by using a cross entropy loss function according to the recognition result of the first sample document and the tag carried by the first sample document.

Optionally, the preset model is a logistic regression model.

Optionally, the word segmentation module is further configured to invoke a preset word segmentation component, and perform word segmentation on the first sample document file by using the preset word segmentation component to obtain each word in the first sample document file.

the word segmentation module is further configured to count the occurrence frequency of each word in the word segmentation result in the first sample document, the total number of words in the first sample document, the total number of sample documents in the sample set, and the number of sample documents containing the word in any word sample set; calculating the word frequency characteristics of each word according to the occurrence frequency of each word in the first sample document and the total word number; calculating the anti-document frequency characteristics of each term according to the total number of sample documents in the sample set and the number of sample documents containing the term in any term sample set; and aiming at any word, determining the characteristics of the word according to the word frequency characteristics and the anti-document frequency characteristics of the word.

According to a fourth aspect of embodiments of the present application, there is provided a resume document identification apparatus, including:

the second acquisition module is configured to acquire a target document to be identified;

the target recognition module is configured to input the target document into the resume document recognition model obtained by training with the method provided by the first aspect of the embodiment of the present application, and obtain a recognition result of whether the target document is a resume document.

According to a fifth aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is used for storing computer-executable instructions, and the processor implements the steps of the method provided by the first aspect or the second aspect of the embodiments of the present application when executing the computer-executable instructions.

According to a sixth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the method provided by the first or second aspect of embodiments of the present application.

According to the resume document identification model training method, a sample set is obtained, wherein the sample set comprises a plurality of sample documents, any sample document is extracted, the sample document is segmented, the segmentation characteristics of the sample document are obtained based on the segmentation result, the preset model is trained based on the segmentation characteristics of all the sample documents, a resume document identification model is obtained, sample information can be effectively obtained by segmenting the sample document and extracting the segmentation characteristics of the sample document, information loss is avoided, the preset model is trained by utilizing the segmentation characteristics of all the sample documents, and the resume document identification model is obtained, so that the efficiency and the accuracy of resume document identification are improved.

Drawings

FIG. 1 is a schematic structural diagram of a resume document identification system according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for training a resume document recognition model according to an embodiment of the present application;

FIG. 3 is a flowchart of another resume document identification model training method provided in an embodiment of the present application;

FIG. 4 is a flowchart of another resume document identification model training method according to an embodiment of the present application;

FIG. 5 is a flowchart of another resume document identification model training method according to an embodiment of the present application;

FIG. 6 is a flowchart of a resume document identification method according to an embodiment of the present application;

FIG. 7 is a flowchart of another resume document identification method provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a resume document recognition model training apparatus according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a resume document identification apparatus according to an embodiment of the present application;

fig. 10 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

Word frequency (TF, term frequency): refers to the number of times a given term appears in the document.

Inverse Document Frequency (IDF): is a measure of the general importance of a word. The anti-document frequency for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term, and taking the resulting quotient to be a base-10 logarithm.

Stop words: in information retrieval, in order to save storage space and improve search efficiency, some characters or Words are automatically filtered before or after processing natural language data (or text), and the characters or Words are called Stop Words.

The present application provides a method for training a resume document recognition model, and at the same time, provides a method for recognizing a resume document, a device for training a resume document recognition model, a device for recognizing a resume document, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

Fig. 1 shows a schematic structural diagram of a resume document identification system according to an embodiment of the present application.

The execution main body comprises a server and an application end, wherein the server is used for training a model, the application end is used for providing a sample for the server, receiving the model trained by the server and inputting a target document to be recognized into the model to obtain a recognition result, the application end comprises a first application end and a second application end, the first application end and the second application end can be different application programs in one terminal and can also be application programs in different terminals, and the first application end and the second application end can be located in the same terminal and can also be located in different terminals.

A possible implementation mode is that a first application end and a second application end are application programs in different terminals, a server end obtains a sample set required by a training model from the first application end and extracts any sample document from the sample set, carries out word segmentation on the sample document, obtains word segmentation characteristics of the sample document based on word segmentation results, trains a preset model based on the word segmentation characteristics of each sample document to obtain a resume document identification model, sends the resume document identification model to the second application end after obtaining the resume document identification model, receives the resume document identification model and obtains a target document to be identified, inputs the target document into the resume document identification model, and obtains an identification result of whether the target document is a resume document or not

The other possible implementation manner is that the first application end and the second application end are different application programs in one terminal, the server end obtains a sample set required by a training model from the first application end and extracts any sample document from the sample set, performs word segmentation on the sample document, obtains word segmentation characteristics of the sample document based on word segmentation results, trains a preset model based on the word segmentation characteristics of each sample document to obtain a resume document identification model, stores the resume document identification model locally after obtaining the resume document identification model, calls the resume document identification model from the first application end and obtains a target document to be identified, inputs the target document into the resume document identification model, and obtains whether the target document is the identification result of the resume document.

By applying the scheme of the embodiment of the application, a target document to be identified is obtained, the target document is input into a resume document identification model to obtain an identification result of whether the target document is the resume document, the resume document identification model obtains a sample set by obtaining the sample set, wherein the sample set comprises a plurality of resume sample documents and can also comprise a plurality of non-resume sample documents, any sample document is extracted, the sample document is segmented, the segmentation characteristics of the sample document are obtained based on the segmentation result, the preset model is trained based on the segmentation characteristics of each sample document to obtain the resume document identification model, the sample information can be effectively obtained by segmenting the sample document and extracting the segmentation characteristics of the sample document, the information loss is avoided, the preset model is trained by utilizing the segmentation characteristics of each sample document to obtain the resume document identification model, therefore, the efficiency and the accuracy of recognizing the resume document are improved.

Fig. 2 is a flowchart illustrating a method for training a resume document recognition model according to an embodiment of the present application, where the method specifically includes the following steps:

step S202: a sample set is obtained, wherein the sample set comprises a plurality of sample documents, and the sample documents comprise resume sample documents.

In the embodiment of the present application, when a resume document identification model is trained, a sample set including a large number of sample documents needs to be obtained, generally, a manner of obtaining the sample set may be that a large number of sample documents manually input form the sample set, or that a large number of sample documents are read from other data obtaining devices or databases to form the sample set, and the manner of obtaining the sample set is specifically selected according to an actual situation, which is not limited in this embodiment of the present application.

Sample documents in the sample set include sample documents of resumes, which are brief, written descriptions of personal calendars, experiences, specialties, hobbies, and other related situations, and sample documents in the sample set may also include sample documents of non-resumes, such as papers, reports, etc., that may or may not be related to resumes. The sample documents in the obtained sample set are generally labeled manually, that is, the sample documents in the obtained sample set carry tags, where the resume sample document carries a positive sample tag whose characterization is a resume document, and the non-resume sample document carries a negative sample tag whose characterization is a non-resume document, for example, the resume sample document carries a tag "1" to characterize the sample document as a positive sample, and the non-resume sample document carries a tag "0" to characterize the sample document as a negative sample.

It should be noted that if the preset model is trained only through the resume sample document, the recognition capability of the preset model for recognizing the resume document is only trained in the training process, and the recognition capability of the non-resume sample document is not increased, so that the non-resume sample document is added in the sample set, the sample is more balanced, and when the preset model is trained subsequently by using the sample document in the sample set, the resume sample document and the non-resume sample document are considered, so that the model generalization capability of the trained resume document recognition model is improved, and the accuracy of model recognition is improved.

Step S204: extracting a first sample document in the sample set, segmenting words of the first sample document, and obtaining segmentation characteristics of the first sample document based on a segmentation result, wherein the first sample document is any sample document in the sample set.

After the sample set is obtained, a plurality of sample documents in the sample set may be processed. The method comprises the steps of extracting a first sample document in a sample set, wherein the first sample document refers to any sample document in the sample set, generally, because the first sample document is long in size, a text contains a large number of stop words which are high in frequency and small in practical meaning, and the stop words can affect the accuracy of a resume document identification model obtained through training. The word segmentation means that the document content is split according to a certain rule, for example, the document content 'i likes to eat apples' is segmented to obtain 'i', 'like', 'eating' and 'apples'; finally, based on the word segmentation result, obtaining word segmentation characteristics of the first sample document, wherein the word segmentation characteristics of the first sample document are obtained from the word segmentation characteristic set of each word, the word segmentation characteristics of each word can be determined by the word frequency characteristics and the anti-document frequency characteristics of each word, or the word frequency and anti-document frequency characteristics, mutual information, information gain and other methods can be adopted to obtain the word segmentation characteristics, and the method is not limited in any way.

In the embodiment of the present application, in order to obtain the word segmentation result of the first sample document, the word segmentation of the first sample document may be performed in the following manner: in a first manner, a preset word Segmentation component is called to perform word Segmentation processing on a first sample document, where the preset word Segmentation component includes, but is not limited to, a Simple chinese word Segmentation system (SCWS), a final word Segmentation tool, and the like, and is specifically selected according to an actual situation, which is not limited in this embodiment of the present application, for example, part of content in the first sample document is "education experience: 2016 read in Qinghua university in 2020; reward condition: in 2017, the first-class prize of the national college student mathematic modeling competition is obtained, the content of the first text grade part is segmented by calling a balance segmentation tool to obtain the segmentation results of education experience, Qinghua university, reward condition and the first-class prize. In the second mode, word segmentation rules are preset, and word segmentation processing is performed on the first sample document according to the preset word segmentation rules, so that word segmentation can be performed on the first sample document without using preset word segmentation components, for example, a word bank is preset, and some contents in the first sample document are subjected to education experience: 2016 read in Qinghua university in 2020; reward condition: the national college student mathematic modeling competition first-class prize is acquired in 2017, and the national college student mathematic modeling competition first-class prize is matched with a preset word stock, and the preset word stock comprises education experience, Qinghua university, reward condition and first-class prize, so that the word segmentation results of the first sample document are the education experience, the Qinghua university, the reward condition and the first-class prize.

Step S206: and training a preset model based on the word segmentation characteristics of each sample document in the sample set to obtain a resume document identification model.

In the embodiment of the application, after the word segmentation is performed on the first sample document and the word segmentation characteristics of the first sample document are obtained based on the word segmentation result, the preset model can be trained based on the word segmentation characteristics of each sample document in the sample set to obtain the resume document identification model. Because the training speed of the Logistic regression model (LR) is fast, it is simple and easy to understand, and the memory resource occupation is small, in this embodiment of the present application, the preset model is a Logistic regression model.

According to a possible implementation manner of the method, the obtained first sample document is a non-resume sample document, the non-resume sample document carries a label '0', the sample document is represented as a negative sample, word segmentation characteristics of the first sample document are extracted, the word segmentation characteristics of the first sample document are input into a preset model, the recognition result of the first sample document is '1', the first sample document is represented as a resume sample document, model parameters of the preset model are adjusted in a returning mode until the preset model can accurately recognize the type of the input sample document, training is stopped, and the current preset model is determined to be a resume document recognition model.

In another possible implementation manner of the application, the obtained first sample document is a resume sample document, the resume sample document carries a label "1", the sample document is represented as a positive sample, the word segmentation feature of the first sample document is extracted, the word segmentation feature of the first sample document is input into the preset model, the recognition result of the first sample document is obtained to be "0", the first sample document is represented as a non-resume sample document, model parameters of the preset model are adjusted in a returning mode until the preset model can accurately recognize the type of the input sample document, training is stopped, and the current preset model is determined to be the resume document recognition model.

By applying the scheme of the embodiment of the application, a sample set is obtained, wherein the sample set comprises a plurality of resume sample documents and can also comprise a plurality of non-resume sample documents, any sample document is extracted, the sample document is segmented, the segmentation characteristics of the sample document are obtained based on the segmentation result, the preset model is trained based on the segmentation characteristics of each sample document to obtain a resume document identification model, the sample document is segmented, the segmentation characteristics of the sample document are extracted, sample information can be effectively obtained, information loss is avoided, the preset model is trained by using the segmentation characteristics of each sample document to obtain the resume document identification model, and therefore the efficiency and the accuracy of resume document identification are improved.

Based on the embodiment shown in fig. 2, step S204 may be specifically implemented by the flowchart shown in fig. 3, and fig. 3 shows a flowchart of another resume document recognition model training method provided in the embodiment of the present application, where the method specifically includes the following steps:

step S2042: and calling a preset word segmentation component, and segmenting words of the first sample document file by using the preset word segmentation component to obtain all words in the first sample document file.

In this application, the first sample document is segmented, and the Words in the first sample document are obtained by calling the preset Segmentation component and utilizing the preset Segmentation component to segment the first sample document, wherein the preset Segmentation component includes, but is not limited to, a Simple Chinese word Segmentation system (SCWS), a final Segmentation tool, and the like, and is selected according to actual conditions "Qinghua university", "computer specialty", "Qinghua university", "achievement excellence", "computer specialty", "national awards", "computer specialty" and "work".

The preset word segmentation component is called, the preset word segmentation component is utilized to segment words of the first sample document, all words in the first sample document are obtained, the accuracy of word segmentation of the first sample document can be improved, and therefore the obtained resume document identification model is more accurate.

Step S2044: and counting the occurrence frequency of each word in the word segmentation result in the first sample document, the total word number in the first sample document, the total number of sample documents in the sample set, and the number of sample documents containing the word in the sample set aiming at any word.

In the embodiment of the application, in order to effectively obtain the sample information and avoid information loss, after obtaining each word in the first sample document, counting the number of occurrences of each word in the first sample document, the total number of words in the first sample document, the total number of sample documents in the sample set, and the number of sample documents including the word in the sample set for any word, referring to the example of step S2042, the number of occurrences of "graduation" in the first sample document is 1, the number of occurrences of "qinghua university" in the first sample document is 2, the number of occurrences of "computer specialty" in the first sample document is 4, the number of occurrences of "achievement excellence" in the first sample document is 1, the number of occurrences of "national scholarship" in the first sample document is 1, and the number of occurrences of "work" in the first sample document is 1; the total word number of the first sample document is 10, the total number of the sample documents in the sample set is 1000, the number of the sample documents containing "graduation" in the sample set is 500, the number of the sample documents containing "qinghua university" in the sample set is 10, the number of the sample documents containing "computer specialty" in the sample set is 100, the number of the sample documents containing "achievement excellence" in the sample set is 500, the number of the sample documents containing "national college" in the sample set is 200, and the number of the sample documents containing "work" in the sample set is 800.

Step S2046: and calculating the word frequency characteristics of each word according to the occurrence frequency of each word in the first sample document and the total word number.

In the embodiment of the application, in order to effectively obtain sample information and avoid information loss, after counting the number of times that each word appears in the first sample document, the total number of words in the first sample document, the total number of sample documents in a sample set in the word segmentation result, and the number of sample documents containing the word in the sample set for any word, the word frequency feature of each word is calculated according to the number of times that each word appears in the first sample document and the total number of words, wherein the word frequency feature of a certain specific word can be obtained by dividing the number of times that the word appears in a file by the sum of the number of times that all words appear in the file, that is, the number of times that the word appears in the file is divided by the total number of words in the first sample document. Referring to the example of step S2044, the word frequency feature of "graduation" is 0.1, the word frequency feature of "qinghua university" is 0.2, the word frequency feature of "computer specialty" is 0.4, the word frequency feature of "achievement excellence" is 0.1, the word frequency feature of "national scholarship fund" is 0.1, and the word frequency feature of "work" is 0.1.

Step S2048: and calculating the anti-document frequency characteristics of each term according to the total number of the sample documents in the sample set and the number of the sample documents containing the term in the sample set aiming at any term.

In the embodiment of the application, in order to effectively obtain sample information and avoid information loss, after counting the number of times that each word appears in a first sample document in a word segmentation result, the total number of words in the first sample document, the total number of sample documents in a sample set, and the number of sample documents containing the word in the sample set for any word, the anti-document frequency characteristic of each word is calculated according to the total number of sample documents in the sample set and the number of sample documents containing the word in the sample set for any word, wherein the anti-document frequency characteristic of a certain specific word can be obtained by dividing the total number of sample documents by the number of documents containing the word and taking the obtained quotient as a logarithm with the base of 10. Referring to the example of step S3044, the "graduation" counter document frequency characteristic is 0.3, the "qinghua university" counter document frequency characteristic is 2, the "computer specialty" counter document frequency characteristic is 1, the "achievement excellent" counter document frequency characteristic is 0.3, the "national prize fund" counter document frequency characteristic is 0.7, and the "working" counter document frequency characteristic is 0.1.

Step S2050: and aiming at any word, determining the characteristics of the word according to the word frequency characteristics and the anti-document frequency characteristics of the word.

In the embodiment of the present application, in order to effectively obtain sample information and avoid information loss, after the word frequency feature and the anti-document frequency feature of any word are obtained, the word frequency feature and the anti-document frequency feature of the word are multiplied to obtain the feature of the word, reference is made to the examples of step S2046 and step S2048, the word frequency feature of "university of qing" is 0.2, and the anti-document frequency feature of "university of qing" is 2, so that the feature of "university of qing" can be obtained to be 0.4. And combining the characteristics of the words, and taking the combined result as the word segmentation characteristics of the first sample document.

In a possible implementation manner of the present application, word frequency and inverse document frequency features may also be separately adopted to obtain word segmentation features, and the method for calculating word frequency and inverse document frequency refers to steps S2046 and S2048.

Another possible implementation manner of the present application may be that a word segmentation feature is obtained by using mutual information, where the mutual information is a relative entropy of joint distribution p (x, y) and edge distribution p (x), p (y), and a calculation formula of the mutual information is:

wherein X, Y is two random variables, p (x, y) is the joint distribution of the two random variables, p (x), and p (y) is the edge distribution.

In another possible implementation manner of the present application, a word frequency feature-inverse document frequency vectorization program (tfidfvactor) in a software machine learning library (scimit-leann) may also be utilized to automatically extract the word frequency feature and the inverse document frequency feature of each word in the first sample document, and then the extraction result is input to a preset model for iterative training.

Based on the embodiment shown in fig. 2, step S206 may be specifically implemented by the flowchart shown in fig. 4, and fig. 4 shows a flowchart of another resume document recognition model training method provided in the embodiment of the present application, where the method specifically includes the following steps:

step S2062: extracting the word segmentation characteristics of the first sample document, and inputting the word segmentation characteristics of the first sample document into a preset model to obtain an identification result of whether the first sample document is a resume document.

Step S2064: and calculating a loss value according to the identification result of the first sample document and the label carried by the first sample document.

Step S2066: and if the loss value is larger than the preset threshold value, adjusting the model parameters of the preset model.

After step S2066 is executed, the process returns to step S2062.

Step S2068: and if the loss value is less than or equal to the preset threshold value, stopping training and determining that the current preset model is the resume document identification model.

In the embodiment of the application, the word segmentation characteristics of the first sample document are extracted, wherein the word segmentation characteristics of the first sample document comprise the characteristics of all words in the first sample document, the word segmentation characteristics of the first sample document are input into a preset model to obtain an identification result of whether the first sample document is a resume document, and a loss value is calculated according to the identification result of the first sample document and a label carried by the first sample document. If the loss value is greater than the preset threshold value, adjusting model parameters of the preset model, returning to the step S2062, and stopping iteration under the condition that the preset training stopping condition is reached to obtain the trained resume document identification model; and if the loss value is less than or equal to the preset threshold value, stopping training, and determining that the current preset model is the resume document recognition model, wherein the preset training stopping condition and the preset threshold value are selected according to the actual situation, and the embodiment of the application is not limited to this.

Illustratively, the obtained first sample document is a non-resume sample document, the non-resume sample document carries a label '0', the sample document is represented as a negative sample, word segmentation characteristics of the first sample document are extracted, the word segmentation characteristics of the first sample document are input into a preset model, the recognition result of the first sample document is '1', the first sample document is represented as a resume sample document, a loss value between the label and the recognition result is calculated by using a loss function, model parameters of the preset model are adjusted in a returning mode until the preset model can accurately recognize the type of the input sample document, training is stopped, and the current preset model is determined to be a resume document recognition model.

The method has the advantages that the sample information is effectively acquired by extracting the word segmentation characteristics of the first sample document, information loss is avoided, the loss value is calculated according to the recognition result of the first sample document and the label carried by the first sample document, and if the loss value is larger than the preset threshold value, the model parameters of the preset model are adjusted, so that the model is more accurate, and the efficiency and the accuracy of the recognition of the resume document are improved.

Based on the embodiment shown in fig. 2, step S206 may be further specifically implemented by a flowchart shown in fig. 5, where fig. 5 shows a flowchart of another resume document recognition model training method provided in the embodiment of the present application, and the method specifically includes the following steps:

Step S20642: and calculating the cross entropy between the identification result of the first sample document and the label carried by the first sample document as a loss value by using a cross entropy loss function according to the identification result of the first sample document and the label carried by the first sample document.

After step S2066 is executed, the process returns to step S2062.

The descriptions of steps S2062, S2066, and S2068 are specifically described in the foregoing embodiments, and are not repeated in this embodiment.

In this embodiment of the application, after obtaining the identification result of the first sample document, the cross entropy between the identification result of the first sample document and the tag carried by the first sample document may be calculated as a loss value by using a cross entropy loss function according to the identification result of the first sample document and the tag carried by the first sample document, where the cross entropy loss function is:

wherein C represents the number of classes, p_iIs true, q_iIs a prediction.

By utilizing the cross entropy loss function, the cross entropy between the identification result of the first sample document and the label carried by the first sample document is calculated to be used as a loss value, so that the efficiency of calculating the loss value is improved, the model is more accurate, and the efficiency and the accuracy of identifying the resume document are improved.

According to another possible implementation manner of the present application, there are many loss functions for calculating the loss value, and the loss functions further include KL divergence (relative entropy), an L1 norm loss function, a maximum loss function, a mean square error loss function, a logarithmic loss function, and the like, which are specifically selected according to actual situations, and the selection of the loss function for calculating the loss value is not limited in the present application.

Fig. 6 is a flowchart illustrating a resume document identification method provided in an embodiment of the present application, where the method specifically includes the following steps:

step S302: acquiring a target document to be identified;

in the embodiment of the present application, the form of the target document to be identified may include many forms, including but not limited to txt, doc, docx, and the like, which are specifically selected according to the actual situation, and this is not limited in this embodiment of the present application.

Step S304: and inputting the target document into the resume document identification model to obtain an identification result of whether the target document is the resume document, wherein the resume document identification model is obtained by training through the resume document identification model training method shown in fig. 2 to 5.

Fig. 7 is a flowchart of another resume document identification method provided in an embodiment of the present application, which specifically includes:

preparing data: the server collects part of the resume sample documents as positive samples, part of the resume sample documents is not samples of the resume documents or samples related to the resume documents as negative samples, and all the collected sample documents jointly form a sample set.

Model training: the server side extracts a first sample document in a sample set, a final word segmentation tool is called to segment words of the first sample document to obtain each word in the first sample document, the number of times of occurrence of each word in the first sample document, the total word number in the first sample document, the total number of sample documents in the sample set and the number of sample documents containing the word in the sample set aiming at any word are counted, the frequency feature of each word is calculated according to the number of times of occurrence of each word in the first sample document and the total word number, the frequency feature of each word is calculated according to the total number of sample documents in the sample set and the number of sample documents containing the word in the sample set aiming at any word, the anti-document frequency feature of each word is calculated, the feature of each word is determined according to the frequency feature of the word and the anti-document frequency feature of the word aiming at any word, and the feature of each word in the first sample document is input into a logic regression model, obtaining a recognition result of whether the first sample document is a resume document, calculating a cross entropy between the recognition result of the first sample document and a label carried by the first sample document as a loss value by using a cross entropy loss function according to the recognition result of the first sample document and the label carried by the first sample document, stopping training if the loss value is less than or equal to a preset threshold value, and confirming that the current logistic regression model is a resume document recognition model; and if the loss value is larger than the preset threshold value, adjusting parameters of the logistic regression model, and performing repeated iterative optimization to obtain the resume document identification model.

And (3) resume identification: and the server side acquires a target document to be identified, inputs the target document into the resume identification model and obtains an identification result of whether the target document to be identified is the resume document.

Corresponding to the above method embodiment, the present application further provides an embodiment of a resume document identification model training device, and fig. 8 shows a schematic structural diagram of a resume document identification model training device provided in an embodiment of the present application. As shown in fig. 8, the apparatus includes:

a first obtaining module 402 configured to obtain a sample set, wherein the sample set includes a plurality of sample documents, and the sample documents include resume sample documents;

a word segmentation module 404 configured to extract a first sample document in the sample set, perform word segmentation on the first sample document, and obtain word segmentation features of the first sample document based on a word segmentation result, where the first sample document is any sample document in the sample set;

the training module 406 is configured to train the preset model based on the word segmentation features of the sample documents in the sample set, so as to obtain a resume document identification model.

the training module 406 is further configured to extract the word segmentation features of the first sample document, input the word segmentation features of the first sample document into a preset model, and obtain a recognition result of whether the first sample document is a resume document; calculating a loss value according to the identification result of the first sample document and the label carried by the first sample document; if the loss value is larger than the preset threshold value, adjusting model parameters of a preset model, returning to execute the step of extracting the word segmentation characteristics of the first sample document, inputting the word segmentation characteristics of the first sample document into the preset model, and obtaining the identification result of whether the first sample document is a resume document; and if the loss value is less than or equal to the preset threshold value, stopping training and determining that the current preset model is the resume document identification model.

Optionally, the training module 406 is further configured to calculate, as the loss value, a cross entropy between the recognition result of the first sample document and the tag carried by the first sample document by using a cross entropy loss function according to the recognition result of the first sample document and the tag carried by the first sample document.

Optionally, the preset model is a logistic regression model.

Optionally, the word segmentation module 404 is further configured to invoke a preset word segmentation component, and segment the first sample document by using the preset word segmentation component to obtain words in the first sample document.

a word segmentation module 404, further configured to count the number of times that each word in the word segmentation result appears in the first sample document, the total number of words in the first sample document, the total number of sample documents in the sample set, and the number of sample documents containing the word in any word sample set; calculating the word frequency characteristics of each word according to the occurrence frequency of each word in the first sample document and the total word number; calculating the anti-document frequency characteristics of each term according to the total number of sample documents in the sample set and the number of sample documents containing the term in any term sample set; and aiming at any word, determining the characteristics of the word according to the word frequency characteristics and the anti-document frequency characteristics of the word.

By applying the scheme of the embodiment of the specification, a sample set is obtained, wherein the sample set comprises a plurality of resume sample documents and can also comprise a plurality of non-resume sample documents, any sample document is extracted, the sample document is segmented, the segmentation characteristics of the sample document are obtained based on the segmentation result, the preset model is trained based on the segmentation characteristics of each sample document to obtain a resume document identification model, the sample document is segmented, the segmentation characteristics of the sample document are extracted, sample information can be effectively obtained, information loss is avoided, the preset model is trained by using the segmentation characteristics of each sample document to obtain the resume document identification model, and therefore the efficiency and the accuracy of resume document identification are improved.

The above is a schematic scheme of a resume document recognition model training device of this embodiment. It should be noted that the technical solution of the resume document identification model training device and the technical solution of the resume document identification model training method belong to the same concept, and details of the technical solution of the resume document identification model training device, which are not described in detail, can be referred to in the description of the technical solution of the resume document identification model training method. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

Corresponding to the above method embodiment, the present application further provides an embodiment of a resume document identification apparatus, and fig. 9 shows a schematic structural diagram of a resume document identification apparatus provided in an embodiment of the present application. As shown in fig. 9, the apparatus includes:

a second obtaining module 502 configured to obtain a target document to be identified;

and the target identification module 504 is configured to input the target document into the resume document identification model, and obtain an identification result of whether the target document is the resume document, wherein the resume document identification model is obtained by training through the resume document identification model training method.

By applying the scheme of the embodiment of the specification, a target document to be recognized is obtained, the target document is input into a resume document recognition model, whether the target document is a recognition result of the resume document is obtained, wherein the resume document recognition model obtains a sample set which comprises a plurality of resume sample documents and also can comprise a plurality of non-resume sample documents, any sample document is extracted, the sample document is segmented, the segmentation characteristics of the sample document are obtained based on the segmentation result, the preset model is trained based on the segmentation characteristics of each sample document to obtain the resume document recognition model, the sample document is segmented, the segmentation characteristics of the sample document are extracted, sample information can be effectively obtained, information loss is avoided, the preset model is trained by utilizing the segmentation characteristics of each sample document to obtain the resume document recognition model, therefore, the efficiency and the accuracy of recognizing the resume document are improved.

The above is an illustrative scheme of the resume document identification apparatus of the present embodiment. It should be noted that the technical solution of the resume document identification apparatus and the technical solution of the resume document identification method belong to the same concept, and details of the technical solution of the resume document identification apparatus, which are not described in detail, can be referred to the description of the technical solution of the resume document identification method. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

Fig. 10 illustrates a block diagram of a computing device 700 provided according to an embodiment of the present application. The components of the computing device 700 include, but are not limited to, memory 710 and a processor 720. Processor 720 is coupled to memory 710 via bus 730, and database 750 is used to store data.

Computing device 700 also includes access device 740, access device 740 enabling computing device 700 to communicate via one or more networks 760. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 740 may include one or more of any type of Network Interface (e.g., a Network Interface Card (NIC)) whether wired or Wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) Wireless Interface, a worldwide Interoperability for microwave access (Wi-MAX) Interface, an ethernet Interface, a Universal Serial Bus (USB) Interface, a cellular Network Interface, a bluetooth Interface, a Near Field Communication (NFC) Interface, and so forth.

In one embodiment of the present application, the above-described components of computing device 700, as well as other components not shown in FIG. 10, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 10 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 700 may also be a mobile or stationary server.

Wherein processor 720 is configured to execute the following computer-executable instructions: obtaining a sample set, wherein the sample set comprises a plurality of sample documents, and the sample documents comprise resume sample documents; extracting a first sample document in the sample set, segmenting words of the first sample document, and obtaining segmentation characteristics of the first sample document based on a segmentation result, wherein the first sample document is any sample document in the sample set; and training a preset model based on the word segmentation characteristics of each sample document in the sample set to obtain a resume document identification model.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the resume document identification model training method described above belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the resume document identification model training method described above.

Processor 720 is also configured to execute the following computer-executable instructions: acquiring a target document to be identified; and inputting the target document into the resume document identification model to obtain an identification result of whether the target document is the resume document, wherein the resume document identification model is obtained by utilizing the resume document identification model training method.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the resume document identification method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the resume document identification method.

An embodiment of the present application further provides a computer-readable storage medium storing computer instructions that, when executed by a processor, are configured to: obtaining a sample set, wherein the sample set comprises a plurality of sample documents, and the sample documents comprise resume sample documents; extracting a first sample document in the sample set, segmenting words of the first sample document, and obtaining segmentation characteristics of the first sample document based on a segmentation result, wherein the first sample document is any sample document in the sample set; and training a preset model based on the word segmentation characteristics of each sample document in the sample set to obtain a resume document identification model.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the resume document identification model training method described above belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the resume document identification model training method described above.

An embodiment of the present specification also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are operable to: acquiring a target document to be identified; and inputting the target document into the resume document identification model to obtain an identification result of whether the target document is the resume document, wherein the resume document identification model is obtained by utilizing the resume document identification model training method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the above resume document identification method belong to the same concept, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above resume document identification method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A resume document recognition model training method is characterized by comprising the following steps:

and training a preset model based on the word segmentation characteristics of the sample documents in the sample set to obtain a resume document identification model.

2. The method of claim 1, wherein the sample documents further comprise a non-resume sample document, wherein the resume sample document carries a positive sample tag that the token is a resume document, and wherein the non-resume sample document carries a negative sample tag that the token is a non-resume document;

the step of training a preset model based on the word segmentation characteristics of each sample document in the sample set to obtain a resume document identification model comprises the following steps:

extracting word segmentation characteristics of the first sample document, inputting the word segmentation characteristics of the first sample document into a preset model, and obtaining an identification result of whether the first sample document is a resume document;

if the loss value is larger than a preset threshold value, adjusting model parameters of the preset model, returning to execute the step of extracting the word segmentation characteristics of the first sample document, inputting the word segmentation characteristics of the first sample document into the preset model, and obtaining a recognition result of whether the first sample document is a resume document;

and if the loss value is less than or equal to the preset threshold value, stopping training and determining that the current preset model is a resume document identification model.

3. The method according to claim 2, wherein the step of calculating a loss value based on the identification result of the first sample document and the tag carried by the first sample document comprises:

and calculating the cross entropy between the identification result of the first sample document and the label carried by the first sample document as a loss value by utilizing a cross entropy loss function according to the identification result of the first sample document and the label carried by the first sample document.

4. The method according to claim 1 or 2, wherein the predetermined model is a logistic regression model.

5. The method according to claim 1 or 2, wherein said step of tokenizing said first sample document comprises:

6. The method of claim 1 or 2, wherein the word segmentation features comprise features of words in the first sample document;

the step of obtaining the word segmentation characteristics of the first sample document based on the word segmentation result comprises:

calculating the word frequency characteristics of each word according to the occurrence frequency of each word in the first sample text file and the total word number;

calculating the anti-document frequency characteristics of each term according to the total number of sample documents in the sample set and the number of sample documents containing the term in the sample set aiming at any term;

7. A resume document identification method is characterized by comprising the following steps:

acquiring a target document to be identified;

inputting the target document into a resume document recognition model obtained by training according to the method of any one of claims 1 to 6, and obtaining a recognition result of whether the target document is a resume document.

8. A resume document recognition model training device, comprising:

a first obtaining module configured to obtain a sample set, the sample set comprising a plurality of sample documents, the sample documents comprising resume sample documents;

the word segmentation module is configured to extract a first sample document in the sample set, segment the first sample document, and obtain word segmentation characteristics of the first sample document based on a word segmentation result, wherein the first sample document is any sample document in the sample set;

and the training module is configured to train a preset model based on the word segmentation characteristics of each sample document in the sample set to obtain a resume document identification model.

9. A resume document identification apparatus, comprising:

a target recognition module configured to input the target document into the resume document recognition model obtained by training according to the method of any one of claims 1 to 6, and obtain a recognition result of whether the target document is a resume document.

10. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the method of any one of claims 1 to 6 or claim 7.

11. A computer-readable storage medium storing computer instructions, which when executed by a processor implement the steps of the method of any one of claims 1 to 6 or claim 7.