CN113886604A

CN113886604A - Job knowledge map generation method and system

Info

Publication number: CN113886604A
Application number: CN202111220412.2A
Authority: CN
Inventors: 戴圣骐; 林自达; 俞希林
Original assignee: Qianjin Network Information Technology (shanghai) Co ltd
Current assignee: Qianjin Network Information Technology (shanghai) Co ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-01-04

Abstract

The invention relates to a method and a system for generating a position knowledge graph, wherein the method comprises the following steps: establishing a corpus with different data states based on the job description data set, the personal resume data set and the encyclopedia knowledge data set; recalling entities from the corpus through an entity recall model to obtain map entities, wherein the entities are nouns or noun phrases; extracting relationships between entities from an encyclopedic knowledge data set through a relationship extraction model, wherein the relationships between the entities are contained or similar; and establishing a mapping relation between the entities according to the relation between the entities. By using the map of the invention, resume of job seeking users and job description of recruitment users are taken as basic information, keywords of the basic information are extracted through semantics and matched with entities in the map provided by the invention, so that the demands of the users can be understood deeply and implicit information in the demand information can be obtained.

Description

Job knowledge map generation method and system

Technical Field

The invention relates to a knowledge map, in particular to a method and a system for generating a job position knowledge map applied to a recruitment platform.

Background

The recruitment platform is an information platform widely used by people in the modern information society. In one aspect, the job seeker can query a job that matches his expectations by a search engine in the recruitment platform, which typically provides a single option or multiple options to determine search criteria. These search options are typically some of the most interesting options for the job seeker. Such as "industry," "job function," "compensation scope," "corporate nature," "job site," and so forth. In fact, the search conditions determined by these determined search options are too broad to correspond to a large amount of information, and a large amount of job information may be searched for first, and for this large amount of job information, the job seeker needs to manually screen by himself or perform a secondary search. Secondly, the existing limited search options cannot well reflect the true will of the job seeker, so that the search result cannot meet the search purpose of the job seeker. In order to enable a candidate to enter his particular search term, a keyword option is typically included in the search options in which the candidate may enter a keyword to search. Because of individual expression differences of job seekers, keywords input by job seekers may be various for the same meaning, so that a search engine cannot correctly understand the keywords, and thus deviations occur during searching. On the other hand, most of the recruitment platforms have a position recommendation function, and the job seeker and the recruiter are matched to obtain a position meeting the demand of the job seeker based on the resume of the job seeker and the recruitment information of the job seeker. However, since the job seeker has various descriptions of key information such as positions, skills and the like in documents such as resumes or related job-seeking requirements and the recruiter has various descriptions of key information such as positions, skills and the like in the recruiting information, different vocabularies and different language expressions may be adopted in the same meaning, which increases difficulty in searching and matching positions.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a job position knowledge graph generation method and system, which are used for providing a plurality of contents with different detailed degrees on the same dimension related to a job position.

In order to solve the above technical problem, according to an aspect of the present invention, there is provided a job knowledge base generation method, including the steps of: establishing a corpus based on the position description data set, the personal resume data set and the encyclopedia knowledge data set; recalling entities from the corpus through an entity recall model to obtain map entities, wherein the entities are nouns or noun phrases; extracting relationships between entities from an encyclopedic knowledge data set through a relationship extraction model, wherein the relationships between the entities are contained or similar; and establishing a mapping relation between the entities according to the relation between the entities.

According to another aspect of the invention, the invention further provides a system for generating the job knowledge graph, which comprises a corpus module, an entity recall module, a relationship extraction module and a graph generation module; wherein the corpus module is configured to establish a corpus based on a job description dataset, a personal resume dataset, and an encyclopedia knowledge dataset; the entity recall module is configured to recall entities from the corpus by an entity recall model to obtain map entities, wherein the entities are nouns or noun-word phrases; the relationship extraction module is connected with the entity recall module and is configured to extract relationships between entities from encyclopedic knowledge data sets through a relationship extraction model, wherein the relationships between the entities are contained or similar; the map generation module is connected with the entity recall module and the relation extraction module, is configured to take the entities as nodes, and establishes connection among the nodes according to the relation among the entities so as to generate the position knowledge map.

The invention utilizes the data uploaded by the user in the platform, such as position description data, personal resume and public encyclopedia data to obtain nouns or noun phrases related to positions as entities, each entity has one or more attributes, the relation between the entities is set as the inclusion and similar relation based on the service requirements of position search, recommendation and the like, and the inclusion or similar relation of the entities among certain attributes can be automatically obtained through the encyclopedia data. When an entity is known, a plurality of entities with the same attribute can be found through the attribute of the entity, the entities are abstracted from detail according to the expressive semantic of the inclusion relationship, and the map is matched with the job classification table, so that the map of the invention takes the resume of a job seeking user and the job description of a recruitment user as basic information, the keywords of the basic information are extracted through the semantic, and the entities in the map provided by the invention are matched for the keywords, thereby deeply understanding the requirements of the user and obtaining the implicit information in the required information.

Drawings

Preferred embodiments of the present invention will now be described in further detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow diagram of a method of the job knowledge-graph generation according to one embodiment of the present invention;

FIG. 2 is a flow diagram of a method of entity recall according to one embodiment of the present invention;

FIG. 3 is a flow diagram of a method of obtaining a data set for entity recall according to one embodiment of the present invention;

FIG. 4 is a data presentation diagram labeled with a single-word lattice of Excel, in accordance with one embodiment of the present invention;

FIG. 5 is a flow diagram of a method for entity relationship extraction according to one embodiment of the invention;

FIG. 6 is a functional block diagram of the job knowledge graph generation system according to one embodiment of the present invention;

FIG. 7 is a functional block diagram of a corpus module according to an embodiment of the present invention;

FIG. 8 is a functional block diagram of an entity recall module according to one embodiment of the present invention;

FIG. 9 is a functional block diagram of a relationship extraction module according to one embodiment of the present invention;

FIG. 10 is a diagram of a portion of entities in a knowledge-graph and their relationship displays, according to one embodiment of the present invention;

FIG. 11 is a flow chart of a job recommendation method according to an embodiment of the present invention; and

fig. 12 is a flowchart of generating a first tab of a job hunting user according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following detailed description, reference is made to the accompanying drawings that form a part hereof and in which is shown by way of illustration specific embodiments of the application. In the drawings, like numerals describe substantially similar components throughout the different views. Various specific embodiments of the present application are described in sufficient detail below to enable those skilled in the art to practice the teachings of the present application. It is to be understood that other embodiments may be utilized and structural, logical or electrical changes may be made to the embodiments of the present application.

The knowledge graph is a semantic network for revealing the relationship between entities, each piece of knowledge is represented as an SPO triple (Subject-predict-Object), is closer to the cognitive thinking of human beings, and provides an effective mode for the expression, organization, management and utilization of massive, heterogeneous and dynamic data on the Internet. The knowledge map system applied to the recruitment platform provided by the invention recalls entities from the documents of recruitment information, resumes, job listings, resumes and the like in some public databases inside the recruitment platform, and extracts the relationships among the entities by utilizing a universal encyclopedic knowledge base consisting of encyclopedic entries captured from the Internet, so that a job knowledge map in the job hunting and recruitment fields is established, wherein the generation process of the job knowledge map is shown in FIG. 1, and the method comprises the following steps:

step S1, a corpus is created.

And step S2, recalling the entity.

And step S3, extracting entity relations.

And step S4, maintaining the map.

In step S1, the data source is first determined. The method takes data in files such as recruitment information, resumes, position lists and resumes in public databases, encyclopedia data (including corpora captured from Wikipedia webpages in Wikipedia, triplets captured from encyclopedia non-hidden webpages, triplets captured from 2015 version 7Lore openly published and full corpora captured based on 30k splicing words) in a recruitment platform as data sources of a corpus, and mainly comprises a position description data set, a personal resume data set and a WIKI data set published by a recruiter, and the data in the three data sets are processed to obtain the corpus which has a standard format and is easy to use. The method specifically comprises the following steps:

firstly, data screening and cleaning are carried out on the three determined corpus data. The method mainly comprises the following steps: deleting unnecessary information influencing the subsequent analysis of the corpus, such as long foreign languages, telephones, mailboxes or addresses and the like; the data format is regulated, and the heterogeneity caused by external factors such as format, symbol and coding among different linguistic data is reduced to the maximum extent; counting the contents of the fields of the entries, and carrying out preliminary judgment on the validity of the fields, and deleting the entries such as overlong, too short, data overflow or obvious unreasonable entries.

And then merging and splitting the speech data. In the invention, data in three data sets are split, and the data are divided into data sets in five states according to different requirements on the linguistic data in the following process and the final split result:

1) raw data set: an uncleaned raw data set;

2) a clean data set, which is the data set after the data cleaning;

3) a paragrah dataset, a dataset that is deduplicated by item and paragraph scales;

4) a sensor data set, which is a prediction data set formed by further sentence separation, de-duplication and sequencing on the basis of the paragraph and taking sentences as units;

5) and generating a title data set aiming at the WIKI data set, wherein the title data set is a word list generated by selecting WIKI representative items meeting specified targets and is mostly used for screening and counting in the subsequent steps.

In this step, the job description data set, the personal resume data set, and the WIKI data set are respectively split at the entry and statement level using the above-mentioned material cleaning result as a frame. On the basis of the cleaned clean file after screening and cleaning, duplicate removal operation of item and paragraph scales is carried out, and the influence of objective factors such as repeated issuing of corresponding positions, repeated delivery of resumes, mutual copying among WIKI paragraphs and the like on corpus diversity in the corpus can be well applied. In order to realize more reliable deduplication effect, methods such as sorting and calculation of Levenshtein similarity proportion merging can be adopted.

In one embodiment, only one entry is stored per line in the corpus, and the format of each entry is defined as follows:

< primary key > < description > |

Where the "|" symbol is a field separation mark (the mark has no space left or right), < primary key > and < description > represent different field contents, respectively. A legal entry may not contain any description fields, but each entry must contain the primary key as an index. Any field only contains Chinese characters, lowercase English, numbers and designated English symbols [,; | A Is there a /+ - # @ ], in particular, spaces are used as illegal symbol substitutes. Any field contains no consecutive symbols. The first field must be keyed to a unique job or resume label and the last field must be a job or resume description (i.e., a core field for semantic analysis). It should be noted that there are a plurality of description fields in the resume entry, including work experience, project experience, and self-introduction. The description field segment does not carry any form of reference number (1, two, iii, etc.), any sentence is marked with [,; | A! Is there a Five symbol divisions, intra-sentence parallel divisions are replaced with [/] symbols (a single pause and/or indicates the case of parallel). The various data sets in the corpus are stored in the form of files, as a specific example, corpus files are stored in the utf-8 encoding format without a signature, and the last entry in the file is followed by a line break (\ n) without content to facilitate splicing.

Among the tasks of knowledge graph generation by the present system, basic entity recall is one of the most critical steps. The number and quality of the entities in the graph directly determine the quality of the final knowledge graph itself and the subsequent services using the graph. In the invention, the entity is defined as nouns or noun combination words, which not only can embody various information of the recruitment platform, but also is convenient for searching and matching when the follow-up service uses the map. The entity recall steps are shown in fig. 2, and include:

in step S21, a data set for entity recall is obtained by preliminarily screening sentences according to a characteristic sentence pattern composed of specific words or patterns and various dictionaries. Since information of interest is different for different application environments, for example, for a position recommendation system, information useful for the position recommendation system generally conforms to a certain specific sentence pattern, such as a zhu sentence pattern formed by beginning nouns of verbs such as familiarity, comprehension, understanding, responsibility, and the like, and a sentence pattern serving as a characteristic sentence pattern is screened from a corpus to form a first initial data set containing the sentence pattern. In addition, in some embodiments, an entity dictionary accumulated through the service is further included, the dictionary records words commonly used in position search, such as position names, skill names, level names and the like, which can be directly used for the entities in the map, and sentences including the entities in the entity dictionary are extracted from the corpus based on the entity dictionary to form the second primary selection data set. In an embodiment of the present invention, the tags in the tag dictionary used in the recommendation system are further utilized to split the tags to obtain new words, and then sentences including the new words are extracted from the corpus to form a third initial data set. In one embodiment, an AC (Aho-Corasick) automaton is adopted to map entities in an entity dictionary and split tag morphemes to a corpus, sentences comprising the entities and the tag morphemes are obtained, and the three primary selection data sets are collected to form a primary screening data set. The specific process is shown in fig. 3, and includes:

in step S211, sentences including the patterns are screened from the corpus by using the characteristic patterns to form a first initial data set.

Step S212, aiming at the existing entity dictionary, an AC automaton is adopted to map the entities in the entity dictionary to a corpus, and sentences comprising dictionary entities and a second primary selection data set are obtained.

Step S213, splitting the labels in the label dictionary to obtain a label morpheme set. This step makes use of some data already available, for example data in a tag dictionary. The morphemes can be composed into new words or phrases from bottom to top, and thus can become important sources of knowledge-graph entities. The label dictionary comprises labels accumulated in the recommendation system, the labels are composed of prefix morphemes and suffix morphemes, and the prefix morphemes or the suffix morphemes are two-word words or three-word words generally. In order to obtain more knowledge-graph entities, the embodiment also extends the tag dictionary.

Step S214, the AC automaton is adopted to map the content in the tag morpheme set to the corpus, and sentences including the tag morphemes are extracted from the corpus.

In step S215, a new vocabulary is extracted from the obtained sentence containing the tagged morphemes. The method comprises the steps of segmenting each sentence one by one, analyzing a grammatical structure, determining morphemes which can be combined with the label morphemes, and obtaining a new word comprising the original label morphemes. And when a new word is obtained according to the tag morpheme, inquiring the word frequency of the new word, and filtering out the new word with the word frequency less than 10. In order to improve the quality of the new words, the length of the new words is limited in the invention, and the new words (cross) which are shorter than the length of the morpheme (two characters) or too long are deleted.

In step S216, sentences including suitable new words are extracted from the corpus to form a third initial data set.

And S217, merging the first primary selection data set, the second primary selection data set and the third primary selection data set to obtain a primary screening data set.

And step S218, labeling the primary screening data set. In one embodiment, a single character lattice of Excel is introduced for displaying data to be marked, namely, one sentence in each row and one character in each column; different kinds of marks use different colors of background marks, and the operation can be quickly finished by using a format brush, so that the marking personnel are greatly facilitated, and the marking speed is increased, as shown in fig. 4. After annotation is complete, the original annotation data from multiple sources is synthesized using Excel, and then converted using python into the biee format that the model can use.

Step S219, corpus alignment. The method mainly comprises the steps of adjusting and correcting a primary screening data set which is marked completely, and aims to expand incompletely recalled phrases, reduce excessively recalled phrases and add partially omitted phrases, so that a corpus data set for entity recall is obtained.

And step S22, based on the annotated corpus database, taking sentences as processing units, and recalling entities by using a Named Entity Recognition (NER) model as an entity recall model to obtain candidate entities. Generally, the task of named entity recognition is to identify named entities of three major classes (entity class, time class, and numeric class), seven minor classes (person name, organization name, place name, time, date, currency, and percentage) in the text to be processed. From the process of identification, two parts are generally involved: 1) identifying entity boundaries; 2) an entity class is determined. In the invention, entity extraction of the knowledge graph takes more importance on identification of entity boundaries.

For various algorithms, the position of the corresponding word in the sentence and the word category or other characteristics (Features) around the word are important criteria for determining whether the word is a valid entity phrase. In particular, with respect to the superior performance of the Hidden Markov Model (HMM) in chinese segmentation, the conditional random field algorithm (CRF) is a more common algorithm in the named entity recognition task, which can effectively complete the sequence tagging task. The objective function of the CRF not only considers the input state feature function, but also includes the label transfer feature function, and various gradient descent methods (SGD, quasi-newton method, etc.) are used for optimizing the model parameters during training. The training completion model can predict an optimal sequence which can maximize an objective function for the input sequence, and the optimal sequence can be obtained by decoding the sequence by using a Viterbi algorithm. Thus, an advantage of CRF is that it can utilize rich internal and contextual feature information in labeling a location.

The sequence granularity, the code initialization and the coding mode under the CRF algorithm framework are main factors influencing the accuracy and the correctness of the named entity recognition model. Therefore, for graph entity recall, the present invention selects the following:

firstly, in the aspect of sequence granularity, because words or phrases which the invention wants to recall have good expansibility and are sensitive to uncertainty caused by the words and phrases, and possible multi-language environments and partial special label influences are generated, the invention selects word granularity sequence representation.

In the aspect of code initialization, recalling of the knowledge graph focuses on more universal Word meaning, and a more complex coding structure is needed for storing information, so that universal fixed Word vector coding (Word2Vec, FastText, dimension 100d) can be selected, and variable vector representation can be selected, for example, a BERT model provides a high-quality 768d pre-training Word vector which has excellent semantic pre-training characteristics and character coverage.

In the aspect of coding algorithm, the knowledge graph recall task has higher recall rate and can run offline, and the requirement on timeliness is not high, so that the invention selects and uses two deep network models of BilSTM and IDCNN to more completely extract words, sequences, word segmentation and even part-of-speech information. More the invention uses the Adam algorithm with variable learning rate as a gradient descent algorithm to simultaneously optimize the neural network and the CRF state transition matrix.

In summary, the present invention is directed to three phases included in the named entity recognition model: the input distributed representation, semantic coding and label decoding are processed as follows respectively: the distributed representation stage of the input uses a Word-granularity sequence, with models provided in the form of fixed Word vector coding (e.g., Word2Vec, FastText, dimension 100d) or varying Word vector coding (e.g., BERT model, dimension 768d) to convert the input sentence into a Word vector representation. In the semantic coding stage, because the model is an entity for recalling the knowledge graph, high recall rate is required, offline operation can be realized, and the requirement on timeliness is not high, the invention uses the BilSTM and IDCNN two deep network models to perform word extraction, sequence generation, word segmentation and even part of speech information on input word vector representation, thereby converting the word vector representation into context-related representation. And in the label decoding stage, a Conditional Random Field (CRF) algorithm is adopted, and the input of the whole model is predicted by taking the related expression as an input so as to obtain a corresponding label sequence. Wherein the neural network and the CRF state transition matrix are simultaneously optimized using a variable learning rate Adam algorithm as a gradient descent algorithm.

The invention divides the marked and adjusted corpus into two parts, wherein a small part is used as model training data, and the rest is used as entity prediction data. And training the entity recall model by utilizing the training data, and training and optimizing the entity recall model and the training data set in a model-data repeated iteration mode until the entity recall model meets the requirements to obtain an ideal entity recall model.

And performing entity prediction based on the prediction data by using the trained entity recall model so as to obtain a candidate entity list. And predicting the prediction data by the trained entity recall model in a sentence scale, wherein the obtained candidate entities are nouns or nominal phrases.

And step S23, filtering the candidate entities to obtain an entity set. When the candidate entities are filtered, nouns of irrelevant domain contents such as names of people, places, names of books, names of games, names of celestial bodies and the like are identified from the candidate entities and deleted from the candidate entity set. Similarly, pure numbers (e.g., 110, 119, etc.), pure symbols (e.g., @, rah, etc.), or candidate entities including illegal symbols (e.g., C in its own copy) are deleted. In addition, the length of the candidate entity cannot be too large or too small, for example, not less than two words, not more than ten words, and less than the candidate entities with the number of words out of the range of 2-10 are deleted. If the entities of the knowledge-graph already exist, entities that duplicate existing entities are removed as compared to the existing entities. In a preferred embodiment, after the filtering operation is performed, the existing candidate entities are subjected to title search in the encyclopedia corpus, and candidate entities without search results or with ambiguous search results are screened out, so that the authenticity and universality of the entities are ensured.

In a better embodiment, the method further includes step S24, re-screening the entities in the obtained entity set by using the classification table. In one embodiment, the classification list is, for example, a job classification list in the recruitment platform, such as industry classification, function classification, and the like. For example, each classification list includes a plurality of different major classes, and a minor class is included below the major classes, such as "computer/internet/communication/electronic", "accounting/finance/banking/insurance" in the industry classification list. Where "computer/internet/communications/electronics" in turn comprises a plurality of subclasses of "computer software", "computer hardware", "network games", etc. In order to provide a good service for job searching and matching when the map is applied, the entities in the map should be matched with the job classification table in the actual application, and therefore, in this step, it is determined whether the entity obtained in step S24 belongs to a classification in the classification table. For example, for entity "Java", it may be matched to "Java development engineer" in the job classification sheet according to the job classification sheet, and it belongs to the "computer/internet/communication/electronics" industry, and thus it is a valid candidate entity, and for entity "U disk", although it is not matched to the corresponding job classification and direct industry classification, it may be matched to the "computer hardware" subclass in the industry classification sheet because it belongs to one kind of computer hardware. In order to make entities in the graph generally known and accepted and to provide inference information which is professional and does not lack common knowledge for the business, therefore, the entities in the knowledge graph should have enough popularity in the corpus but not common words in each document. When an entity is determined to be a valid candidate entity, such as "Java" or "usb disk", the valid candidate entity is searched in the candidate entity set, and the number of occurrences of the valid candidate entity meeting the classification table is counted, for example, when the number of occurrences of "Java" meeting the functional classification table is greater than the threshold number, it is determined that "Java" belongs to the entity and is classified into the entity set. And for the U disk, the occurrence frequency of the U disk conforming to the function classification table is smaller than the frequency threshold value, and the U disk is confirmed not to belong to the entity called in the invention and cannot be classified into the entity set.

And step S25, labeling classification dimensions for the entities. In order to determine the association relationship between the entities, the method and the system set the corresponding classification dimension for the entities according to the angle to be considered in the recruitment platform. The classification dimension is, for example, industry, function, skill, language, academic calendar, position level, work type, etc., and the entity can embody information of one or more angles through the set classification dimension. Each entity includes one or more classification dimensions, e.g., the entity "entertainment management" contains both industry and function information.

In step S3, the entity relationships in the present invention are divided into two categories according to the needs of job recommendation: one class is the similarity relationship (is _ similarity); one type is the containment relationship (is _ included). When the relationship is an inclusion relationship, the inclusion relationship between two entities has directionality, one type of forward-contained (forward-included), that is, the following entity contains the preceding entity, and the other type of backward-contained (backward-included), that is, the preceding entity contains the following entity, and the directional relationship between the two entities can be determined according to the forward-contained or backward-contained relationship. The data used to extract entity relationships is the unwashed encyclopedia (WIKI) data, which is an encyclopedia knowledge base consisting of encyclopedia entries captured from the Internet, the sources relating to encyclopedia, interactive encyclopedia, and Chinese Wikipedia, and the captured content relating to entry titles, first paragraph blurb, entry labels, and inter-entry links. The specific extraction process is shown in fig. 5, and includes the following steps:

in step S31, entities in the entity set are marked in the encyclopedia data. In one embodiment, entities are tagged from encyclopedia data using location identifiers.

In step S32, a sentence having two entities is filtered out using a specific rule. The specific rule is for example: a sentence with words of 'including', 'containing', etc. representing inclusion relation, and two or more entities having parallel relation in the sentence with the words; alternatively, in a sentence, two entities are connected by the words "is", "i.e." and "is also called". The corpus is filtered according to step S32 to obtain a relationship extraction dataset, in which each piece of data includes entity one, entity two, and sentences including entity one and entity two. Examples shown in the following Table-1:

TABLE-1

Step S33, predicting the relationship between two entities in the sentence-scale relationship extraction dataset by using the relationship extraction model. The method adopts a relationship extraction model to obtain the entity relationship. Firstly, in this embodiment, a deep learning algorithm of supervised classification is used to solve the problem of extracting relationships between entities, position identifiers are used for representing positions of two entities, a BiGRU model (an RNN model variation) is used to implement word-scale vector mapping on training data sentences, meanwhile, a single-head attention mechanism is introduced to correct the weight of sequence vectors on word scale, and finally, the relationship extraction problem is converted into a sequence classification problem.

A small portion of data, such as 10000 sentences including two entities, is taken from the data set screened in accordance with step S32. These statements are generated into a corpus format:

the x (entity one) y (entity two) relation contains the statements of entities and relations.

The entity relations are an inclusion relation and a similar relation, and when two entities coexist but the inclusion and similar entity relations do not exist, the two entities are marked as unknown relations.

And continuously optimizing, training data quality and the entity relationship extraction model by adopting a model-training data mutual iteration method until the entity relationship extraction model meets the requirements.

And predicting the relationship between the two entities in the sentence from sentence to sentence according to the sentence scale on the residual data screened in the step S32 by using the relationship extraction model.

In one sentence, the entity I and the entity II have an inclusion relationship or a similar relationship in a common classification dimension, if the inclusion relationship is the inclusion relationship, a forward inclusion relationship or a backward inclusion relationship is also determined, and if the inclusion relationship or the similar relationship is not determined at present, an unknown relationship (unknown) is tentatively determined. As shown in the following table-2:

TABLE-2

Entity one	Entity two	Entity relationships
			White spirit	Maotai liquor	Comprises (backward _ included)
Bean curd jelly	Jellied bean curd	Similarity of
			Natural gas	Pipeline	Is unknown
Computer with a memory card	Office equipment	Contains (forward _ included)
			app	Mobile phone software	Similarity of
Liver disease	Hepatitis (HAV)	Comprises (backward _ included)

Two entities having a containment relationship can have directivity according to a forward or backward relationship, a node from which a direction starts is a parent node, and a node from which the direction arrives is a child node, so that the parent node contains the child node. As in the example of table-2, "white spirit" includes "altar" backward, the "white spirit" is a parent node, and "altar" is a child node, and the directional relationship between the "white spirit" and the "altar" is that the "white spirit" points to the "altar". Two nodes with similar relationships do not point to a relationship.

The knowledge graph is established based on the entities and the relationships between the entities obtained in the foregoing steps S2 and S3, with each entity as a node and the relationships between the entities as an edge. The knowledge graph library comprises a plurality of associated nodes, wherein each node comprises a node label and one or more corresponding classification dimensions (or attributes), and the nodes are connected with the nodes with the mapping relations, such as inclusion relations or similar relations, according to different attributes. The same attribute of each knowledge node in the knowledge graph can be connected with the previous level node or the next level node according to the inclusion relationship, so that the mapping relationship of one attribute is a multi-level chain. The meaning of the expression of the nodes on the multilevel chain of the mapping relation is abstracted to concrete from the root node.

In one embodiment, the mapping relationship between entities is stored as a configuration file, sequence numbers are set for the entities for the convenience of query and matching operation, and sequence numbers are set for the classification dimensions, so that the searching efficiency can be improved through the serialization of the entities.

In order for an atlas to meet the increasing business demands, the atlas needs to be maintained in real time or periodically. That is, when a new corpus is added, a new entity is recalled from the newly added corpus according to step S2, and then the entity relationship is extracted. In this step, in order to establish the relationship with the entity in the original map, in one embodiment, the new entity is labeled back to the corpus according to step S31, then a new sentence with two entities is obtained according to step S32, at least one of the entities is the new entity, so as to obtain a new relationship extraction data set, and the entity relationship of the new entity, the relationship between the new entity and the existing entity are obtained in the new relationship extraction data set according to step S33, and the new entity and the new relationship are added to the original map.

FIG. 6 is a functional block diagram of the job knowledge graph generation system according to one embodiment of the present invention. In this embodiment, the position knowledge graph generation system includes a corpus module 1, an entity recall module 2, a relationship extraction module 3, and a graph generation module 4. The corpus module 1 is configured to establish a corpus with different data states based on the job description data set, the personal resume data set, and the encyclopedia knowledge data set. The raw data and voice data of this embodiment are from three types of data: the job description data set and the personal resume data set from the platform database, and the encyclopedic knowledge data set obtained through the web crawler module 6.

As shown in fig. 7, the corpus module 1 at least includes a data cleaning unit 11 and a data merging and splitting unit 12, where the data cleaning unit 11 screens and cleans the original three data sets, including: deleting unnecessary data, such as telephone, mailbox or address; regulating the data format; count entry field contents and delete entries such as too long, too short, data overflow, or apparently unreasonable entries. The data merging and splitting unit 12 merges and splits the three cleaned data sets respectively. In one embodiment, the data set is divided into four states according to the requirements of entity recall and relationship extraction: the method comprises the steps of obtaining an original data set, a cleaned data set, a data set subjected to item and paragraph scale de-duplication, and a predicted data set which is formed by further sentence dividing, de-duplication and sequencing on the basis of paragraph and takes a sentence as a unit. Each data set comprises the data of the four states so as to meet the subsequent use requirement.

As shown in fig. 8, the entity recall module 2 includes a first data preparation unit 21, an entity obtaining unit 22 and an entity filtering unit 23, wherein, in an embodiment, the first data preparation unit 21 screens out data sets that can be used for entity recall from a corpus according to the flow shown in fig. 3 from three data sets, including a prediction data set for entity prediction by an entity recall model. The entity obtaining unit 22 performs entity prediction based on the prediction data set using the named entity recognition model as an entity recall model to obtain candidate entities. The volume filtering unit 23 is connected to the entity obtaining unit 22, and filters the candidate entities according to a filtering rule to obtain a candidate entity list (which may also be referred to as a candidate entity set) including a plurality of candidate entities. Wherein, the filtering rule includes filtering out nouns or sentence and part of speech phrases in other fields, such as names of people, sentences of land, etc.; filtering out some pure numbers, pure symbols, etc.; nouns or sentence-part phrases with less than two words and more than ten words are filtered out. In addition, when the knowledge graph already exists and the currently obtained entity is the entity obtained from the new corpus, the comparison is carried out with the current old entity so as to filter out the existing entity.

In order to better adapt the obtained entity to the job search requirement, in a better embodiment, the entity screening unit 24 is further included, and is configured to screen the entities in the obtained entity set by using the job classification table, and screen out the entities which do not conform to the job classification table and/or which are present in the corpus with the number of times of conforming to the job classification table being less than a threshold value. The job classification table is, for example, a job database in which all job data in the platform are stored. The entity screening unit 24 searches in the job database system with the entities in the candidate entity list as search targets, and if the candidate entities are searched in the job database system, and the number of the searched entities is greater than a certain threshold, for example, 10, the candidate entities are considered to be reserved, otherwise, the candidate entities have no practical significance, the candidate entities are deleted from the candidate entity list, and the entity set of the map is obtained after the screening. In order to determine the relationship of an entity to other entities, the classification dimension of each entity also needs to be determined. In one embodiment, based on the characteristics of the job, a classification dimension profile is stored in the system, wherein various angles of interest to the job are defined, such as industry, skill, function, job level, academic calendar, job type, and the like. Each entity has at least one classification dimension. Therefore, the entity recall module 2 further includes a classification dimension labeling unit 25, and after the entity screening unit 24 finishes screening, sends a notification to the classification dimension labeling unit 25, where the classification dimension labeling unit 25 labels the classification dimension for the entities in the entity set.

The invention can use the existing trained named entity recognition model as an entity recall model to recall entities. In the absence of a trained named entity recognition model, the entity recall module 2 further includes an entity recall model unit 26 for separating a certain amount of data from the data set as a training set of models. An entity recall model unit 26 trains and optimizes the entity recall model and the training data set in a model-data iterative manner based on the training data set until the entity recall model meets requirements. For the entity recall model and the training thereof, the related description in the aforementioned method is omitted for brevity.

FIG. 9 is a functional block diagram of the relationship extraction module according to one embodiment of the present invention. The relationship extraction module 3 is connected with the entity recall module 2 and the corpus 1, and is configured to extract relationships between entities from the encyclopedic knowledge data set through a relationship extraction model, wherein the relationships between the entities are contained or similar. Wherein the relationship extraction module 3 comprises: an entity labeling unit 31, a second data preparation unit 32 and a relationship extraction unit 33. The entity labeling unit 31 is connected to the entity recall module 2, and labels entities in the entity set in the encyclopedia knowledge data set in the corpus 10. The second data preparation unit 32 filters out sentences having two entities from the encyclopedic knowledge data set using a certain rule to constitute a relationship extraction data set. When the current system is in the map maintenance phase, the entity recalling module 2 recalls a new entity, so the entity labeling unit 31 also needs to label the new entity in the encyclopedic knowledge data set, and the second data preparation unit 32 needs to filter out a new sentence from the encyclopedic knowledge data set as a new relationship extraction data set, wherein at least one of two entities in the new sentence is a new entity.

The relationship extraction unit 33 is connected to the second data preparation unit 32 and configured to predict a relationship between two entities from the relationship extraction dataset using a relationship extraction model. If there is no trained model, the relationship extraction module 3 further includes a corpus generating unit 34 and a relationship extraction model unit 35. The corpus generating unit 34 extracts a preset number of sentences from the relationship extraction dataset and converts the sentences into a corpus format. The relation extraction model unit 35 trains and optimizes the relation extraction model and the corpus based on the corpus in a model-data iteration manner until the relation extraction model meets the requirements. When the current system is in the map maintenance stage, the relationship extraction unit 33 extracts the relationship between two new entities from the new relationship extraction dataset, the relationship between the new entities and the existing entities. And after the new relationship is obtained, the map generation module 4 is informed to add the new entity and the new entity relationship to the original map. The map can comprise more and more knowledge nodes and mutual relations through regular or real-time map maintenance, so that the coverage can be realized

The map generation module 4 takes the entities as nodes and the relationship between the entities as a connection basis, thereby establishing the knowledge map. Fig. 10 is a schematic diagram illustrating connection of some nodes in the graph. The map can be presented in the form of a table or text, and can also be presented in the form of a graph by using a visual interface.

Application examples

Because a great deal of information is gathered on the recruitment platform, if the recruitment platform simply depends on manual search of job seekers and recruiters, finding a suitable position or individual among a great deal of information is time-consuming and very difficult. Therefore, in order to increase the success rate of job hunting or recruitment on the recruitment platform and help job seekers and recruiters to improve efficiency, the recruitment platform can act to recommend positions for job seekers or to recommend talents for recruiters. In the present embodiment, the job seeker is targeted to recommend the user. Generally, a job seeker uploads a job resume to a recruitment platform, or fills the resume according to format requirements of the platform, and performs operations such as searching and viewing on the recruitment platform. And similarly, the recruiter sends the recruitment information to the recruitment platform, or fills the recruitment information according to the format requirement of the platform, searches and checks some job hunting information. The platform obtains the demand information of the job seeker according to the resume uploaded or filled by the job seeker; and obtaining the demand information of the recruiter according to the recruitment information uploaded or filled by the recruiter. According to the embodiment, the recruiter is recommended to the job seeker according to the respective demand information of the recruiter and the job seeker and the behavior data of searching, viewing and the like of the recruiter and the job seeker on the recruiting platform. Referring to fig. 11, a recommendation flow chart is shown.

Step S1a, obtaining a first tag of a job seeker and a second tag of a recruiter. The resume of the job seeker and the recruitment information of the recruiter are used as the requirement information of the resume and the recruitment information of the recruiter, and the first label of the job seeker and the second label of the recruiter are obtained from the requirement information. The process of acquiring the first tag is shown in fig. 12:

step S11a, extracting a plurality of keywords from the requirement information of the target recommendation user and obtaining corresponding semantic tags. For example, all the text contents in the resume of the job seeker are read and semantically recognized, so that a plurality of keywords are obtained, such as "Java engineer", "software development", "proficient Java", "C + + development", and the like, or "language teacher", "language in teaching elementary school stage", "language tutoring", "part-time teaching composition", or "clothing sales", "women's clothing shopping guide", and the like. The system is provided with a prefix word list and a suffix word list, each prefix word and each suffix word have corresponding standard words, and the semantic tags are formed by replacing the prefix words and the suffix words with the corresponding standard words.

And step S12a, matching corresponding knowledge nodes for each semantic label by using a knowledge map library. A plurality of nodes are determined for each semantic tag by using a knowledge graph library. For example, outputting the keyword "Hibernate development" to the knowledge graph library, an "engineer" node of the functional attribute can be obtained; skill attribute 'Hibernate' node, 'Java' node; and "software" nodes of industry attributes.

Step S13a, generating one or more first labels according to the matched one or more nodes. Specifically, the labels include prefix words and suffix words, wherein the attributes of the prefix words are trades and skills, and the attributes of the suffix words are functions, so that nodes with the attributes of the former trades, the skills and the like and nodes with the attributes of the functions are combined in pairs to obtain one label. For example, when a functional node "engineer", a skill node "Java", and an industry node "software" are obtained from the keyword "Java development", two labels { occupation direction: "software engineer" }; { direction of occupation: "Java engineer" }, because two labels belong to the same kind, obtain more accurate label { occupation direction: "Java Engineers". For another example, the labels { "occupation directions" are obtained by combining and integrating the nodes obtained by matching the semantic labels corresponding to the keywords "software development" and "proficient C + +": "C + + engineer".

The same obtains a plurality of second tags for the recruiter.

Step S2a, filtering the plurality of recruiters. Wherein, the first label of the target recommending user (a specific job seeker) is matched with the second labels of a plurality of recruiters, so as to filter out the recruiters which are not matched with the first label, and finally the rest recruiters are the recruiters which are in accordance with the first label of the target recommending user.

And a step S3a of sorting the plurality of recruiters. In one embodiment, the dimensions correspond to the tag types, and each tag type is used as a dimension to sort the plurality of recruiters to obtain a sorting value Vi. For example, after filtering, 20 recruiters are obtained in total, corresponding to the dimension of "pay", values in the label with the category of "pay" are used as base values, the base values are respectively compared with the pay provided by 20 recruiters, the 20 recruiters are sorted according to the difference from small to large, and each recruiter obtains a sorting value V in the dimension of "pay"_m. When the dimension of ' distance (D) from home ' is used for sorting, the address of a job seeker is used as a coordinate, the distance D from each recruiter to the address of the job seeker is calculated, 20 recruiters are sorted according to the sequence of the distances D from small to large, and a sorting value V is obtained by each recruiter in the dimension of ' distance (D) from home_D. And calculating the final ranking V of each recruiter according to the weight of the target recommendation user in each dimension and the ranking value of each recruiter in each dimension. Namely, it is

Wherein v is_iIs the rank value of the ith dimension, q_iAnd recommending the weight of the user in the ith dimension for the target.

And step S4a, generating recommendation information according to the preset number of the recruiters ranked in the front and pushing the recommendation information to the target recommendation user. The recommendation information comprises information such as a recruitment company name, a recruitment position name, a recruitment information page link issued by a recruiter, a position work place of the recruitment information, a position salary range and the like. And then the information is pushed to the job seeker through a pop-up window or a mail.

In the embodiment, after the keywords are obtained from the user demand information, the knowledge map is adopted for matching, and the labels which are similar to the keyword information but different in type are obtained, so that the implicit information in the user demand information can be mined, the recommendation precision is high, the matching degree of the recommended positions and the user demand is good, the time of job seekers and/or recruiters can be saved, and the job hunting success rate of the job seekers and the recruitment success rate of the recruiters are increased.

The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art can make various changes and modifications without departing from the scope of the present invention, and therefore, all equivalent technical solutions should fall within the scope of the present invention.

Claims

1. A position knowledge graph generation method comprises the following steps:

establishing a corpus based on the position description data set, the personal resume data set and the encyclopedia knowledge data set;

recalling entities from the corpus through an entity recall model to obtain map entities, wherein the entities are nouns or noun phrases;

extracting relationships between entities from an encyclopedic knowledge data set through a relationship extraction model, wherein the relationships between the entities are contained or similar; and

and establishing a mapping relation between the entities according to the relation between the entities to generate the position knowledge graph.

2. The method of claim 1, further comprising the step of preparing a dataset for entity recall based on a corpus:

extracting a primary screening data set from the corpus by using the characteristic sentence pattern and the specific dictionary;

labeling the primary screening data set; and

and adjusting the corpus to obtain an entity recall data set.

3. The method of claim 2, wherein the specific dictionary comprises a physical dictionary and a label dictionary, and the step of extracting the prescreened data set from the corpus using the feature sentence pattern and the specific dictionary comprises:

extracting sentences which accord with the characteristic sentence patterns from the corpus by utilizing the characteristic sentence patterns to form a first primary selection data set;

extracting sentences including entities in the entity dictionary from the corpus based on the entity dictionary to form a second primary selection data set;

splitting the labels in the label dictionary to obtain a label morpheme set;

mapping the morphemes in the tagged morpheme set to a corpus, and extracting sentences comprising the tagged morphemes from the corpus;

extracting new words from sentences containing tagged morphemes;

extracting sentences containing new words from the corpus to form a third initial data set; and

and merging the first primary selection data set, the second primary selection data set and the third primary selection data set to serve as primary screening data sets.

4. The method of claim 2, wherein the tagging of the initially filtered data set is performed by tagging nouns or noun phrases in the sentence.

5. The method of claim 2, further comprising: separating a preset amount of data from the entity recall data set to be used as a training data set, wherein the rest data are prediction data sets; the method further comprises:

constructing an entity recall model;

training and optimizing the entity recall model and the training data set in a model-data iteration mode until the entity recall model meets the requirements;

performing entity prediction based on the prediction dataset using the entity recall model to obtain candidate entities; and

and filtering the candidate entities according to a filtering rule to obtain an entity set comprising a plurality of entities.

6. The method of claim 5, further comprising: and screening the entities in the obtained entity set by utilizing the position classification table, and screening out the entities which do not accord with the position classification table and/or accord with the position classification table and appear in the corpus less than a threshold value.

7. The method of claim 6, further comprising: and labeling classification dimensions for the entities in the entity set.

8. The method of claim 1, further comprising:

marking out entities in the encyclopedic knowledge data set;

screening out sentences with two entities from the encyclopedic knowledge data set by using a specific rule to form a relationship extraction data set, wherein the specific rule is used for expressing that the two entities have a containing or similar relationship; and

a relationship between two entities is predicted from the relationship extraction dataset using a relationship extraction model.

9. The method of claim 8, further comprising:

extracting a preset number of sentences from the relation extraction data set to serve as a training data set;

converting sentences in the training data set into a training corpus format; the format of the training corpus is as follows: x, y, relation, statements containing entities and relations;

wherein x is entity one, y is entity two, and relationship is the entity relationship between entity one and entity two, which is an inclusion relationship, a similarity relationship or an unknown relationship; and

and training and optimizing the relation extraction model and the training data set according to a mode of mutual iteration of model and data until the relation extraction model meets the requirements.

10. The method according to claim 1, wherein when new corpus is obtained after the atlas is generated, the method further comprises the following steps:

recalling new entities from the new corpus through an entity recall model;

extracting the relationships between new entities and between the new entities and the existing entities from the encyclopedic knowledge data set through a relationship extraction model; and

and adding new entities and mapping relations among the new entities and between the new entities and the existing entities in the original map.

11. A position knowledge graph generation system comprising:

a corpus module configured to establish a corpus based on a job description dataset, a personal resume dataset, and an encyclopedia knowledge dataset;

an entity recall module configured to recall entities from the corpus through an entity recall model to obtain map entities, wherein the entities are nouns or noun word groups;

a relationship extraction module, connected with the entity recall module, configured to extract relationships between entities from an encyclopedia knowledge dataset through a relationship extraction model, wherein the relationships between the entities are inclusive or similar; and

and the map generation module is connected with the entity recall module and the relation extraction module and is configured to establish a mapping relation between the entities according to the relation between the entities so as to generate the position knowledge map.

12. The system of claim 11, wherein the entity recall module comprises:

a first data preparation unit configured to prepare a data set for entity recall based on a corpus, including a predicted data set;

an entity acquisition unit configured to perform entity prediction based on the prediction dataset using an entity recall model to obtain candidate entities; and

and the entity filtering unit is connected with the entity acquiring unit and is used for filtering the candidate entities according to a filtering rule to obtain an entity set comprising a plurality of entities.

13. The system of claim 12, wherein the entity filtering unit further filters existing entities from candidate entities.

14. The system according to claim 12, further comprising an entity screening unit configured to be connected to the entity filtering unit, and to screen the entities in the obtained entity set by using the position classification table, so as to screen out entities that do not conform to the position classification table and/or conform to the position classification table and are present in the corpus less than a threshold number of times.

15. The system of claim 14, further comprising a classification dimension labeling unit configured to label a classification dimension for an entity in the set of entities.

16. The system of claim 12, wherein the data sets for entity recall further comprise a training data set, the entity recall module further comprising an entity recall model unit configured to train, optimize an entity recall model and the training data set in a model-data iterative manner based on the training data set until the entity recall model meets requirements.

17. The system of claim 11, wherein the relationship extraction module comprises:

an entity annotation unit connected with the entity recall module and configured to annotate entities in an encyclopedia knowledge data set;

a second data preparation unit configured to screen out sentences having two entities from the encyclopedia knowledge data set using a specific rule to constitute a relationship extraction data set, wherein the specific rule is used for expressing that the two entities have an inclusive or similar relationship; and

a relationship extraction unit configured to predict a relationship between two entities from the relationship extraction dataset using a relationship extraction model.

18. The system according to claim 17, wherein when a new entity is obtained from the new corpus, at least one of the two entities in the sentence screened by the second data preparation unit is the new entity; the relationship extraction unit predicts the relationship between the two entities as the relationship between the two new entities and the relationship between the new entity and the old entity.

19. The system of claim 17, wherein the relationship extraction module further comprises:

a corpus generating unit, coupled to the second data preparation unit, configured to extract a preset number of sentences from the relational extraction dataset and convert them into a corpus format; and

and the relation extraction model unit is configured to generate a relation extraction model, train and optimize the relation extraction model and the training corpora based on the training corpora according to a model-data iteration mode until the relation extraction model meets the requirements.