Nothing Special   »   [go: up one dir, main page]

CN117390297A - Large-scale talent intelligence library information optimization matching method - Google Patents

Large-scale talent intelligence library information optimization matching method Download PDF

Info

Publication number
CN117390297A
CN117390297A CN202311706817.6A CN202311706817A CN117390297A CN 117390297 A CN117390297 A CN 117390297A CN 202311706817 A CN202311706817 A CN 202311706817A CN 117390297 A CN117390297 A CN 117390297A
Authority
CN
China
Prior art keywords
column vector
column
similarity
cluster
covariance matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311706817.6A
Other languages
Chinese (zh)
Other versions
CN117390297B (en
Inventor
庞秋宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Tongde And Light Polytron Technologies Inc
Original Assignee
Tianjin Tongde And Light Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Tongde And Light Polytron Technologies Inc filed Critical Tianjin Tongde And Light Polytron Technologies Inc
Priority to CN202311706817.6A priority Critical patent/CN117390297B/en
Publication of CN117390297A publication Critical patent/CN117390297A/en
Application granted granted Critical
Publication of CN117390297B publication Critical patent/CN117390297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of information retrieval, in particular to a large-scale talent intelligence library information optimization matching method, which comprises the following steps: obtaining each column vector corresponding to the talent intelligent library, and carrying out clustering treatment on all column vectors to obtain each cluster; determining regularization necessity according to the modular length difference and element distribution similarity between any two column vectors in the cluster, and further judging whether regularization treatment is carried out; if regularization is carried out, constructing a column vector retention function to determine a maximum non-zero value; updating the initial covariance matrix according to the maximum non-zero value to obtain a regularized and optimized covariance matrix, and further performing PAC (programmable logic controller) dimension reduction on each column vector corresponding to the talent intelligent library; and carrying out optimization matching processing based on the dimension reduction processing result. According to the method, the similarity degree of the personal intelligent library data information is quantized, the accuracy of the dimension reduction processing result is improved, and the method is favorable for realizing the optimal matching of the accurate personal intelligent library information.

Description

Large-scale talent intelligence library information optimization matching method
Technical Field
The invention relates to the technical field of information retrieval, in particular to a large-scale intelligent personal library information optimization matching method.
Background
The talent-oriented library contains a large amount of individual and multidimensional information and requires efficient large-scale data processing techniques such as distributed computing, streaming, and efficient query and indexing techniques for large-scale databases. For different types of data, effective feature engineering needs to be carried out, the data is converted into vector representation, and then matching of information is facilitated when the vector data is analyzed, and the meaning of information optimization matching is as follows: efficiency, accuracy and individuation are improved. Through reasonable matching, the requirements of different subjects can be better met, so that the development of the fields of recruitment, social interaction, recommendation and the like is promoted. For better matching, PCA (Principal Component Analysis, principal component analysis algorithm) vector dimension reduction technique is used for analysis, which is beneficial to reducing dimension complexity, removing redundant information, highlighting main features and the like.
The covariance matrix generated in the process of performing the dimension reduction operation on the multidimensional vector by using the PCA algorithm is influenced by the data similarity or relevance under the data scene of the talent intelligent library, and has singularity. The matrix with singularities cannot be decomposed into complete non-zero eigenvalues and corresponding eigenvectors, so that numerical instability or errors occur in the process of calculating the eigenvectors later, deviation exists in vector dimension reduction processing results, and the accuracy of information optimization matching is low.
Disclosure of Invention
In order to solve the technical problem that the information optimization matching accuracy is low because a singular covariance matrix cannot be decomposed into a complete non-zero eigenvalue and a corresponding eigenvector in the dimension reduction process, the invention aims to provide a large-scale intelligent personal library information optimization matching method, which adopts the following specific technical scheme:
the embodiment of the invention provides a large-scale talent intelligent library information optimization matching method, which comprises the following steps:
acquiring information data of all talents in a talent intelligent library, and performing data preprocessing on the information data of all talents to acquire each column vector corresponding to the talent intelligent library; acquiring an initial covariance matrix according to each column vector;
selecting a column vector, screening out target elements according to the discrete degree of each element analysis element in the column vector, and determining the dimension of the target elements as a target dimension; clustering all column vectors according to a preset clustering rule according to elements of a target dimension in each column vector to obtain each cluster;
for any cluster, determining the similarity of all column vectors in the cluster according to the modular length difference and element distribution similarity between any two column vectors in the cluster; determining the average value of the similarity of all column vectors in each cluster as the regular necessity of an initial covariance matrix; judging whether the initial covariance matrix is subjected to regularization treatment or not according to the regularization necessity;
if regularization is carried out, constructing a column vector retention function according to each element in each column vector of the initial covariance matrix, and determining the maximum non-zero value corresponding to each column vector of the initial covariance matrix;
updating the initial covariance matrix according to the maximum non-zero value corresponding to each column vector to obtain a regularized and optimized covariance matrix; and performing PAC dimension reduction processing on each column vector corresponding to the talent intelligent base according to the regularized and optimized covariance matrix, and performing optimization matching processing based on the dimension reduction processing result.
Further, screening out the target element according to the discrete degree of each element analysis element in the column vector, including:
determining any element in the column vector as a pending element, calculating variances and mean values of other elements except the pending element in the column vector, and taking the ratio of the variances to the mean values as a variation coefficient of the pending element;
normalizing the variation coefficient of each element in the column vector to obtain each variation coefficient after normalization; and regarding each variation coefficient after normalization processing, taking an element corresponding to any variation coefficient in a preset variation coefficient range as a target element.
Further, the preset clustering rule includes:
for all column vectors, the column vectors with the same elements of the target dimension are divided into the same cluster, and the undivided column vectors are divided into the same cluster.
Further, determining the similarity of all column vectors in the cluster according to the modular length difference and the element distribution similarity between any two column vectors in the cluster, including:
if the elements of the target dimension of all the column vectors in the cluster are the same, determining the similarity weights corresponding to the two column vectors according to other elements except the elements of the target dimension in any two column vectors in the cluster;
if the elements of the target dimensions of all the column vectors in the cluster are different, determining the similarity weights corresponding to the two column vectors according to each element in any two column vectors in the cluster;
determining a similarity factor between two column vectors according to the similarity weights corresponding to the two column vectors and the modular length difference of the two column vectors;
obtaining all similar factors corresponding to the cluster, and calculating the accumulation sum of all the similar factors; and carrying out normalization processing on the accumulated sum of all the similarity factors, and taking the normalized numerical value as the similarity of all column vectors in the cluster.
Further, determining the similarity weights corresponding to the two column vectors includes:
if the elements of the target dimension of all the column vectors in the cluster are the same, calculating cosine similarity between two elements of the same position except the elements of the target dimension in any two column vectors in the cluster, and obtaining corresponding cosine similarity;
if the elements of the target dimensions of all the column vectors in the cluster are different, calculating cosine similarity between two elements at the same position in any two column vectors in the cluster to obtain corresponding cosine similarity;
and calculating the accumulated value of each cosine similarity, carrying out normalization processing on the accumulated value, and taking the accumulated value after normalization processing as the similarity weight corresponding to the two column vectors.
Further, determining a similarity factor between two column vectors according to the similarity weights corresponding to the two column vectors and the modulo length difference of the two column vectors, includes:
determining the modular length of the two column vectors, and further calculating the absolute value of the difference value of the modular lengths of the two column vectors; the product of the inverse proportion value of the absolute value of the difference between the modulo lengths of the two column vectors and the similarity weight is taken as a similarity factor between the two column vectors.
Further, the calculation formula of the column vector retention function is:
the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->The retention degree of the ith column vector of the initial covariance matrix is obtained, and Norm is a linear normalization function; />An ith column vector of the initial covariance matrix,/>,/>1 st element in the ith column vector of the initial covariance matrix,/th element>2 nd element in the ith column vector of the initial covariance matrix,/and->N is the number of the elements in the ith column vector of the initial covariance matrix; />For the increment of each element in the ith column vector of the initial covariance matrix, ++>For column vector->And column vector->Cosine similarity between->For the variance function.
Further, determining a maximum non-zero value for each column vector of the initial covariance matrix comprises:
in the calculation formula of the column vector retention function, the increment corresponding to the maximum retention of the ith column vector is taken as the maximum non-zero value corresponding to the ith column vector of the initial covariance matrix.
Further, updating the initial covariance matrix according to the maximum non-zero value corresponding to each column vector to obtain a regularized and optimized covariance matrix, which comprises the following steps:
for any column vector of the initial covariance matrix, adding each element in the column vector with the maximum non-zero value corresponding to the column vector to obtain a new column vector; and acquiring each new column vector, and taking a matrix formed by each new column vector as a regularized and optimized covariance matrix.
Further, the data preprocessing is performed on the information data of all talent individuals to obtain each column vector corresponding to the talent intelligent library, including:
converting information data of all talents into initial column vectors by OneHot coding to obtain each initial column vector; and carrying out standardization processing on the initial column vector to obtain a standardized initial column vector, and taking the standardized initial column vector as a column vector.
The invention has the following beneficial effects:
the invention provides a large-scale talent intelligent library information optimization matching method, which is mainly applicable to the technical field of information matching, and is characterized in that based on information data of all talent individuals in a talent intelligent library, the regularization necessity of an initial covariance matrix is quantized, and whether the initial covariance matrix is subjected to regularization processing is judged based on the regularization necessity, so that whether the initial covariance matrix has singularity or not is facilitated to be judged, and deviation of a dimension reduction result caused by the singularity covariance matrix in the dimension reduction process is overcome based on scene characteristic information; when the regularization necessity is calculated, clustering is carried out on each column vector corresponding to the talent intelligent library to obtain each cluster, which is beneficial to dividing the column vector with high similarity into the same cluster, and the reliability and accuracy of the regularization necessity value of the initial covariance matrix are effectively improved by calculating the similarity of all column vectors in the clusters with different attributes, so that the reference value of a judgment result of judging whether regularization is carried out or not is enhanced; based on each element in each column vector of the initial covariance matrix, a column vector retention function is constructed, regularization processing is realized on the premise of ensuring the original matrix characteristics, self-adaptive determination of the maximum non-zero value corresponding to each column vector is facilitated, robustness of maximum non-zero value determination is improved, and subsequent restoration of data characteristics on the basis of ensuring dimension reduction is facilitated; the regularized and optimized covariance matrix can be decomposed into complete non-zero eigenvalues and corresponding eigenvectors, so that vector dimension reduction processing results with higher accuracy are obtained, and accurate optimization matching of information is realized.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for optimizing and matching information of a large-scale talent database.
Detailed Description
In order to further describe the technical means and effects adopted by the present invention to achieve the preset purpose, the following detailed description is given below of the specific implementation, structure, features and effects of the technical solution according to the present invention with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The specific scene aimed by the invention is as follows: when PCA dimension reduction is used for dimension reduction of information data of a human intelligent library, the relevance between the information data of the human intelligent library is high, so that a covariance matrix in the dimension reduction process is easily changed into a singular matrix, the singular matrix cannot be decomposed into complete non-zero characteristic values and corresponding characteristic vectors, and further effective dimension reduction cannot be performed, namely original data characteristics are lost in the dimension reduction process, and the information optimization matching accuracy is low.
In order to improve the accuracy of information optimization matching, the positive rule necessity of a covariance matrix is obtained by combining vector similarity, and the regularized maximum non-zero value is determined according to vector retention so as to eliminate the influence of a singular matrix on PCA dimension reduction. Specifically, the embodiment provides a method for optimizing and matching information of a large-scale talent library, as shown in fig. 1, comprising the following steps:
s1, acquiring information data of all talents and individuals in a talent intelligence library, and acquiring each column vector corresponding to the talent intelligence library; and obtaining an initial covariance matrix according to each column vector.
In the first step, each column vector corresponding to the talent intelligent library is obtained.
In this embodiment, information data of all talents in a talent intelligent repository is collected, and the talent intelligent repository contains information data of a large number of talents, including but not limited to: professional information, project experience, industry experience and the like, each individual has corresponding information data, namely the information data of the personal intelligence library comprises multiple dimensions in multiple aspects, and different dimensions have different information. In order to perform effective feature engineering, data preprocessing is performed on information data of all talents and individuals to obtain each column vector corresponding to a talent intelligent library, wherein the column vector comprises the following specific steps:
converting information data of all talents into initial column vectors by OneHot coding (one-bit effective coding) to obtain each initial column vector; and carrying out standardization processing on the initial column vector to obtain a standardized initial column vector, and taking the standardized initial column vector as a column vector.
Thus, each column vector is obtained, and each column vector can be m n-dimensional vectors, m is the total number of column vectors, and n is the number of elements in a single column vector. Wherein each occupation has a fixed encoding result for occupation information, for example, a teacher may encode as 001. The implementation process of converting text information (such as project experience and industry experience) into vector data and the standardized implementation process are both prior art, and are not described in detail herein without departing from the scope of the present invention.
And secondly, acquiring an initial covariance matrix according to each column vector.
In this embodiment, each column vector is obtained to construct a covariance matrix in the vector dimension reduction process, and is denoted as an initial covariance matrix. The row and column vectors of the initial covariance matrix represent the covariance values between the different vectors, respectively, i.e. whether the correlation between the different vectors varies in the same way, which is intended to prepare the data for vector dimension reduction. The construction process of the covariance matrix is the prior art, and is not in the scope of the present invention, and will not be described in detail here.
S2, clustering all column vectors according to each element in each column vector to obtain each cluster.
The individual column vectors correspond to one individual person, and the individual person has larger cardinality, and the data information of different dimensions corresponding to each individual person is relatively similar, so that the most obvious data characteristics of the data information of the individual person intelligence library are similarity, namely, higher similarity exists among different column vectors, and the necessity of dimension reduction processing is reflected to a certain extent. In order to reduce the calculation amount and improve the accuracy of the similarity value calculated later, firstly dividing all column vectors into different clustering clusters based on the dimension characteristics of the column vectors, and specifically implementing the steps can include:
in the first step, a column vector is selected, target elements are screened out according to the discrete degree of each element analysis element in the column vector, and the dimension of the target elements is determined to be the target dimension.
And a first sub-step of determining any element in the column vector as a pending element, calculating variances and mean values of other elements except the pending element in the column vector, and taking the ratio of the variances to the mean values as a variation coefficient of the pending element.
As an example, the calculation formula of the coefficient of variation of the undetermined element may be:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein c is the coefficient of variation of the undetermined element in the column vector, ">For the variance of other elements in the column vector than the undetermined element, +.>Is the mean of the other elements in the column vector except for the element to be determined.
A second sub-step of normalizing the variation coefficient of each element in the column vector to obtain each variation coefficient after normalization; and regarding each variation coefficient after normalization processing, taking an element corresponding to any variation coefficient in a preset variation coefficient range as a target element.
It should be noted that, each element in the column vector has its corresponding coefficient of variation, when the coefficient of variation is large, it means that when the dimension of the corresponding element is used as a key factor in the subsequent clustering condition, the distribution of other elements is more discrete, and no clustering property is provided, so that the dimension of the corresponding element in the column vector does not have the basic condition of clustering.
In this embodiment, the coefficient of variation of each element in the column vector is normalized by using a linear normalization function; the preset variation coefficient range may be set to be smaller than 0.8, that is, between 0 and 0.8, and the preset variation coefficient range may be set by an implementer according to specific practical situations, and is not particularly limited.
And secondly, clustering all column vectors according to a preset clustering rule according to elements of a target dimension in each column vector to obtain each cluster.
In this embodiment, the preset clustering rule refers to: for all column vectors, the column vectors with the same elements of the target dimension are divided into the same cluster, and the undivided column vectors are divided into the same cluster.
Clustering all column vectors to obtain each cluster, wherein the specific implementation steps comprise:
assuming that the target dimension is occupation, dividing column vectors with the same occupation into the same cluster, for example, clustering column vectors with the same occupation as teachers into one cluster, or clustering column vectors with the same occupation as doctors into one cluster, wherein each cluster corresponding to the column vectors with the same target dimension element can be obtained, and the number of the column vectors in the cluster is not less than 2; then, it is determined whether there are remaining column vectors, that is, column vectors not divided into clusters, which are column vectors having different target dimension elements, and if there are remaining column vectors, all column vectors not divided into clusters are clustered into one cluster. Thus, the present embodiment obtains each cluster corresponding to all column vectors.
S3, determining the regularization necessity of the initial covariance matrix according to the modular length difference and the element distribution similarity between any two column vectors in each cluster, and further judging whether the initial covariance matrix is subjected to regularization treatment or not.
In order to avoid the influence that the initial covariance matrix cannot completely acquire the corresponding eigenvalue and eigenvector due to vector similarity, and the final dimension reduction result loses the original data characteristics, the initial covariance matrix in the dimension reduction process needs to be optimized. The higher the similarity between the vectors, the higher the regularization necessity of the initial covariance matrix, and the higher the possibility that the initial covariance matrix is regularized, so that by judging the similarity of the vectors, whether the initial covariance matrix needs to be regularized can be obtained. After the singular matrix is regularized, the eigenvalue and the corresponding eigenvector can be completely calculated through eigenvalue decomposition, so that the dimension reduction of the accurate vector is realized.
First, for any cluster, determining the similarity of all column vectors in the cluster according to the modular length difference and element distribution similarity between any two column vectors in the cluster.
Taking any cluster as an example to calculate and describe the similarity, the specific implementation steps can include:
the first substep, determining the similarity weight corresponding to any two column vectors in the cluster.
It should be noted that, before quantizing two vectors in a cluster, based on the principle and step of column vector acquisition, it is known that there is a greater similarity or association between each pair of elements in the same coordinate position in the two column vectors, so that the greater the corresponding weight of the two vectors in the cluster is when the similarity is analyzed and calculated subsequently, that is, the greater the contribution degree in calculating the similarity factor between the two column vectors.
In this embodiment, based on the clustering property of the cluster, the similarity weights corresponding to any two column vectors in the cluster are determined, that is, the cluster properties of the cluster are different, and the calculation process of the corresponding similarity weights is different.
Firstly, if the elements of the target dimension of all the column vectors in the cluster are the same, determining the similarity weights corresponding to the two column vectors according to other elements except the elements of the target dimension in any two column vectors in the cluster.
Calculating cosine similarity between two elements at the same position except for elements of a target dimension in any two column vectors in the cluster, and obtaining corresponding cosine similarity; and calculating the accumulated value of each cosine similarity, carrying out normalization processing on the accumulated value, and taking the accumulated value after normalization processing as the similarity weight corresponding to the two column vectors.
As an example, the calculation formula of the similarity weights corresponding to the two column vectors may be:
the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->When the element phases of the target dimensions of all column vectors in the cluster are used, the similarity weight corresponding to the ith column vector and the jth column vector in the cluster is given, norm is a linear normalization function, n is the number of elements in the column vectors, and the sum of the element phases is equal to the sum of the element phases in the column vectors>For cosine similarity function, ++>For the h element in the ith column vector in the cluster except the element of the target dimension, +.>Is the h element in the j-th column vector in the cluster except the element of the target dimension.
In the calculation formula of the similarity weight corresponding to the two corresponding column vectors when the elements of the target dimension of all the column vectors in the cluster are the same, if the two column vectors except for the elements of the target dimension in the two column vectors in the cluster still have higher similarity, the similarity of the matching results of the two column vectors is higher. Therefore, other data information between two individuals corresponding to the two column vectors is similar except the element information of the target dimension, and the contribution degree of the two column vectors in the calculation result of the overall similarity is larger.
And secondly, if the elements of the target dimensions of all the column vectors in the cluster are different, determining the similarity weights corresponding to the two column vectors according to each element in any two column vectors in the cluster.
Calculating cosine similarity between two elements at the same position in any two column vectors in the cluster to obtain corresponding cosine similarity; and calculating the accumulated value of each cosine similarity, carrying out normalization processing on the accumulated value, and taking the accumulated value after normalization processing as the similarity weight corresponding to the two column vectors.
As an example, the calculation formula of the similarity weights corresponding to the two column vectors may be:
the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->For the difference of the elements of the target dimensions of all column vectors in the cluster, the similarity weight corresponding to the ith column vector and the jth column vector in the cluster, norm is a linear normalization function, n is the number of elements in the column vector, and->For cosine similarity function, ++>For the h element in the i-th column vector in the cluster, and (2)>Is the h element in the j-th column vector in the cluster.
And referring to the calculation process of the similarity weights corresponding to the two column vectors in the cluster, calculating the similarity weights by taking any two column vectors in the cluster as a pair, and obtaining all the similarity weights in each cluster.
And a second sub-step of determining a similarity factor between the two column vectors according to the similarity weights corresponding to the two column vectors and the modular length difference of the two column vectors.
Determining the modular length of the two column vectors, and further calculating the absolute value of the difference value of the modular lengths of the two column vectors; the product of the inverse proportion value of the absolute value of the difference between the modulo lengths of the two column vectors and the similarity weight is taken as a similarity factor between the two column vectors.
In this embodiment, each cluster has its corresponding similarity, and the column vectors are divided into different clusters to perform similarity calculation, so that it can be effectively avoided that when the target dimension elements are different, the module length difference of two column vectors is smaller, resulting in higher similarity between the two vectors; meanwhile, the calculation amount of the similarity factors among the column vectors can be reduced, and the calculation efficiency is improved.
As an example, the calculation formula of the similarity factor between two column vectors may be:
the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->For the similarity factor between the ith column vector and the jth column vector in the cluster, +.>For the similarity weight corresponding to the ith column vector and the jth column vector in the cluster, +.>Modulo length of ith column vector in cluster,/-for the column vector in cluster>Modulo length of jth column vector in cluster,/->For->Absolute value of>Is->Is a counter-proportion of (c).
In the calculation formula of the similarity factor, the similarity factor can represent the similarity degree between two vectors, and the similarity weightSimilar factor->As positive correlation, similarity weightThe larger the weight, the more similar the elements of the same dimension in the two column vectors are, the more similar the two vectors are; />Can be used for representing the difference degree between the modular lengths of two column vectors, and the larger the difference degree is, the more dissimilar the two vectors are, so +.>Similar factor->Is in a negative correlation relationship with the correlation,the smaller the similarity factor, the larger.
And a third sub-step, determining the similarity of all column vectors in the cluster according to all similarity factors corresponding to the cluster.
Obtaining all similar factors corresponding to the cluster, and calculating the accumulation sum of all the similar factors; and carrying out normalization processing on the accumulated sum of all the similarity factors, and taking the normalized numerical value as the similarity of all column vectors in the cluster.
In this embodiment, any two column vectors in the cluster are a pair, and referring to the first sub-step to the second sub-step in this step, the similarity factors of each pair of column vectors corresponding to the cluster can be obtained, that is, all the similarity factors corresponding to the cluster are obtained. In order to quantify the similarity degree of all column vectors in the cluster, calculating the similarity of all column vectors in the cluster, performing accumulation calculation on all similarity factors, and performing normalization processing on the accumulation sum of all the similarity factors by using a linear normalization function so as to facilitate subsequent numerical calculation. The normalization process of the linear normalization function is prior art and will not be described in detail here.
And secondly, determining an average value of the similarity of all column vectors in each cluster as the regularization necessity of the initial covariance matrix, and judging whether the initial covariance matrix is subjected to regularization treatment according to the regularization necessity.
The first sub-step, the average value of the similarity of all column vectors in each cluster is determined as the regular necessity of the initial covariance matrix.
In this embodiment, referring to the calculation process of determining the similarity of all column vectors in any cluster in the first step of step S3, the similarity of all column vectors in each cluster may be calculated; and the clustering clusters with the same target dimension elements are independently subjected to similarity calculation, so that the similarity among all talents in a talent intelligence library can be more accurately quantified, and the numerical accuracy of the regularization necessity of an initial covariance matrix can be improved.
As an example, the calculation formula of the canonical necessity of the initial covariance matrix may be:
the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->For the regular necessity of the initial covariance matrix, P is the number of clusters, P is the serial number of the clusters, < ->For the similarity of all column vectors in the p-th cluster, +.>Is the p-th cluster.
In a calculation formula of regularization necessity, the necessity of regularization is carried out by quantizing an initial covariance matrix through the similarity of all column vectors in each cluster, and the average value of the similarity of all column vectors in all clusters can measure the overall similarity of column vector data; the overall similarity and the regularization necessity are in positive correlation, and the larger the overall similarity is, the larger the necessity of regularization of the initial covariance matrix is, namely the larger the regularization necessity of the initial covariance matrix is; the overall similarity value ranges from 0 to 1, so the canonical necessities range from 0 to 1.
And a second sub-step of judging whether the initial covariance matrix is regularized according to the regularization necessity.
In the present embodiment, the determination threshold value is set toI.e. 0.8, when the canonical necessity is greater than the decision threshold +.>When the initial covariance matrix is judged to need regularization treatment, the maximum non-zero value is determined, and the initial covariance matrix is optimized; otherwise, the initial covariance matrix is judged to be unnecessary to be regularized, and normal PCA dimension reduction processing can be performed subsequently. The decision threshold may be set by the practitioner according to specific practical situations, and is not specifically limited here.
The regularization necessity obtained by the similarity between the data information of the intelligent database is helpful for judging whether the initial covariance matrix is a singular matrix, and if the initial covariance matrix is the singular matrix, regularization processing is needed. Compared with the existing PCA dimension reduction processing directly, the situation that the correlation of the information data of the intelligent database is high is not considered, the embodiment can effectively avoid the occurrence of numerical instability or errors in the process of calculating the feature vector, improve the reference value of the vector dimension reduction processing result, and further improve the accuracy of information optimization matching.
And S4, if regularization processing is carried out, constructing a column vector retention function according to each element in each column vector of the initial covariance matrix, and determining the maximum non-zero value corresponding to each column vector of the initial covariance matrix.
It should be noted that, the higher the regularization necessity, the closer the element values within the initial covariance matrix are, the larger the amount of diagonal element variation of the initial covariance matrix, i.e., the larger the maximum non-zero value, at the time of regularization. The maximum non-zero value is determined in order to optimize the initial covariance matrix in order to obtain the eigenvalues and eigenvectors of the matrix as completely as possible.
In this embodiment, the regularized column vectors have large differences, so that in order to obtain the eigenvalues and eigenvectors completely while preserving the column vector attributes, the maximum non-zero value needs to be obtained according to the column vector retention, that is, the maximum non-zero value when regularized is obtained on the basis of satisfying the vector retention. The larger the maximum non-zero value in the regularization process, the lower the vector attribute retention, and the column vector retention function is constructed based on the attribute.
After the column vectors change the values through regularization, the more similar the increment of each dimension element in the same column vector is, the more the attribute of the original column vector can be represented, so that the increment of each element in the same column vector is the same when a column vector retention function is constructed; because the column vector with the increased elements has consistent change characteristics with the original column vector, the retention can be represented by the vector variance after the elements are increased, the smaller the variance is, the more the variance approaches to the change condition of the original column vector, the smaller the influence on the original column vector is, and the larger the retention is; after the column vector is regularized, the smaller the value of the element increment is, the closer the value is to the attribute of the original column vector; the degree of retention of column vectors after regularization is quantified by calculating cosine similarity between vectors before and after element increase.
As an example, the calculation formula of the column vector retention function may be:
the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->The retention degree of the ith column vector of the initial covariance matrix is obtained, and Norm is a linear normalization function; />An ith column vector of the initial covariance matrix,/>,/>1 st element in the ith column vector of the initial covariance matrix,/th element>2 nd element in the ith column vector of the initial covariance matrix,/and->N is the number of the elements in the ith column vector of the initial covariance matrix; />For the increment of each element in the ith column vector of the initial covariance matrix, ++>For column vector->And column vector->Cosine similarity between->For the variance function.
In the calculation formula of the column vector retention function,the retention degree of original vector characteristic information of the ith column vector of the initial covariance matrix after regularization can be represented; in order to ensure that the characteristics of the original column vectors are not lost, such as the industry experience, project experience and the like of talent individuals are not ambiguous, regularization is required on the basis of ensuring the characteristics of the original column vectors; when regularizing the column vector, a certain element in the initial covariance matrix changes, namely the corresponding element in the column vector is increased; the increment of each element in the same column vector is the same. />Ith for initial covariance matrixColumn vector regularized vector; regularized increment->The larger the i-th column vector, the smaller the retention; />The smaller the retention of the ith column vector; />The larger the i-th column vector, the greater the retention.
It should be noted that, based on the calculation formula of the column vector retention function, the maximum non-zero value corresponding to each column vector of the initial covariance matrix may be determined. In the calculation formulaIs an independent variable +.>As a dependent variable ++>The higher the retention is, the greater the retention is, the more the value of the i-th column vector is, the greater the retention is, and therefore the increment corresponding to the maximum retention of the i-th column vector can be used as the maximum non-zero value corresponding to the i-th column vector of the initial covariance matrix. The maximum non-zero value is a key index for realizing matrix regularization, and the reliability of the maximum non-zero value determined in a self-adaptive manner is higher by combining the actual condition of the initial covariance matrix, so that the accurate regularization processing is facilitated.
Thus, based on each element in each column vector of the initial covariance matrix, the maximum non-zero value corresponding to each column vector of the initial covariance matrix can be obtained through a column vector retention function.
S5, acquiring a regularized and optimized covariance matrix; and performing PAC dimension reduction processing on each column vector corresponding to the talent intelligent base according to the regularized and optimized covariance matrix, and performing optimization matching processing based on the dimension reduction processing result.
First, obtaining a regularized and optimized covariance matrix.
In order to facilitate the covariance matrix to be decomposed into complete non-zero eigenvalues and corresponding eigenvectors, the initial covariance matrix is updated according to the maximum non-zero value corresponding to each column vector, and the covariance matrix after regularization optimization is obtained, the specific implementation steps may include:
for any column vector of the initial covariance matrix, adding each element in the column vector with the maximum non-zero value corresponding to the column vector to obtain a new column vector; and acquiring each new column vector, and taking a matrix formed by each new column vector as a regularized and optimized covariance matrix.
In this embodiment, the maximum non-zero value corresponding to each column vector is assigned to each corresponding element in the initial covariance matrix, e.g., the maximum non-zero value corresponding to the ith column vector isEach element in the ith column vector corresponds to a maximum non-zero value +.>The element value of each element in the ith column vector is equal to +.>And adding to obtain a new element value, and taking the ith column vector formed by the new element value as the ith new column vector. And carrying out corresponding maximum non-zero value addition on element values of all elements in each column vector in the initial covariance matrix, and taking the initial covariance matrix with the maximum non-zero value addition calculation as a regularized and optimized covariance matrix.
And secondly, performing PAC dimension reduction processing on each column vector corresponding to the talent intelligent base according to the regularized and optimized covariance matrix, and performing optimization matching processing based on the dimension reduction processing result.
In this embodiment, after obtaining the regularized and optimized covariance matrix, the eigenvalue and eigenvector are calculated, and then the principal component is selected to construct the projection matrix, so as to implement the dimension reduction processing of each column vector corresponding to the human intelligence library. Based on the regularized and optimized covariance matrix, the subsequent PAC dimension reduction processing step is carried out, and a dimension reduction processing result with good accuracy can be obtained. Based on the dimension reduction processing result, the neighbor search algorithm (k-nearest neighbors) can be utilized to realize the optimized matching processing of the information. Other implementation means exist for implementing information optimization matching at that time, and are not described in detail here. The implementation process of PAC dimension reduction processing and neighbor search algorithm is the prior art, and is not in the scope of the present invention, and will not be described in detail here.
The PCA algorithm with better dimension reduction result is selected to carry out dimension reduction processing, and the algorithm can effectively reduce the dimension of data, remove redundant information, retain the main characteristics of the data and the like; when the data information of talent intelligent library is optimally matched, the similarity between vectors is higher, the algorithm advantage can be effectively exerted in the dimension reduction process, and a better dimension reduction effect is achieved.
The invention provides a large-scale intelligent personal library information optimization matching method, which adds the maximum non-zero value to an initial covariance matrix through a regularization technology, eliminates the instability influence of a singular matrix on PCA dimension reduction on the premise of ensuring that vector features are not influenced, improves the reliability of dimension reduction results, and further improves the accuracy of information optimization matching.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention and are intended to be included within the scope of the invention.

Claims (10)

1. The large-scale talent intelligent library information optimization matching method is characterized by comprising the following steps of:
acquiring information data of all talents in a talent intelligent library, and performing data preprocessing on the information data of all talents to acquire each column vector corresponding to the talent intelligent library; acquiring an initial covariance matrix according to each column vector;
selecting a column vector, screening out target elements according to the discrete degree of each element analysis element in the column vector, and determining the dimension of the target elements as a target dimension; clustering all column vectors according to a preset clustering rule according to elements of a target dimension in each column vector to obtain each cluster;
for any cluster, determining the similarity of all column vectors in the cluster according to the modular length difference and element distribution similarity between any two column vectors in the cluster; determining the average value of the similarity of all column vectors in each cluster as the regular necessity of an initial covariance matrix; judging whether the initial covariance matrix is subjected to regularization treatment or not according to the regularization necessity;
if regularization is carried out, constructing a column vector retention function according to each element in each column vector of the initial covariance matrix, and determining the maximum non-zero value corresponding to each column vector of the initial covariance matrix;
updating the initial covariance matrix according to the maximum non-zero value corresponding to each column vector to obtain a regularized and optimized covariance matrix; and performing PAC dimension reduction processing on each column vector corresponding to the talent intelligent base according to the regularized and optimized covariance matrix, and performing optimization matching processing based on the dimension reduction processing result.
2. The method for optimizing and matching information in a large-scale talent library according to claim 1, wherein the step of screening out the target element according to the discrete degree of each element analysis element in the column vector comprises the steps of:
determining any element in the column vector as a pending element, calculating variances and mean values of other elements except the pending element in the column vector, and taking the ratio of the variances to the mean values as a variation coefficient of the pending element;
normalizing the variation coefficient of each element in the column vector to obtain each variation coefficient after normalization; and regarding each variation coefficient after normalization processing, taking an element corresponding to any variation coefficient in a preset variation coefficient range as a target element.
3. The method for optimizing and matching information of a large-scale talent library according to claim 1, wherein said preset clustering rule comprises:
for all column vectors, the column vectors with the same elements of the target dimension are divided into the same cluster, and the undivided column vectors are divided into the same cluster.
4. The method for optimizing and matching information of a large-scale talent library according to claim 3, wherein determining the similarity of all column vectors in a cluster according to the modular length difference and element distribution similarity between any two column vectors in the cluster comprises:
if the elements of the target dimension of all the column vectors in the cluster are the same, determining the similarity weights corresponding to the two column vectors according to other elements except the elements of the target dimension in any two column vectors in the cluster;
if the elements of the target dimensions of all the column vectors in the cluster are different, determining the similarity weights corresponding to the two column vectors according to each element in any two column vectors in the cluster;
determining a similarity factor between two column vectors according to the similarity weights corresponding to the two column vectors and the modular length difference of the two column vectors;
obtaining all similar factors corresponding to the cluster, and calculating the accumulation sum of all the similar factors; and carrying out normalization processing on the accumulated sum of all the similarity factors, and taking the normalized numerical value as the similarity of all column vectors in the cluster.
5. The method for optimizing matching of information in a large-scale talent library according to claim 4, wherein determining the similarity weights corresponding to the two column vectors comprises:
if the elements of the target dimension of all the column vectors in the cluster are the same, calculating cosine similarity between two elements of the same position except the elements of the target dimension in any two column vectors in the cluster, and obtaining corresponding cosine similarity;
if the elements of the target dimensions of all the column vectors in the cluster are different, calculating cosine similarity between two elements at the same position in any two column vectors in the cluster to obtain corresponding cosine similarity;
and calculating the accumulated value of each cosine similarity, carrying out normalization processing on the accumulated value, and taking the accumulated value after normalization processing as the similarity weight corresponding to the two column vectors.
6. The method for optimizing and matching information in a large-scale talent library according to claim 4, wherein determining a similarity factor between two column vectors based on the similarity weights corresponding to the two column vectors and the modular length difference of the two column vectors comprises:
determining the modular length of the two column vectors, and further calculating the absolute value of the difference value of the modular lengths of the two column vectors; the product of the inverse proportion value of the absolute value of the difference between the modulo lengths of the two column vectors and the similarity weight is taken as a similarity factor between the two column vectors.
7. The method for optimizing and matching information of a large-scale talent library according to claim 1, wherein the calculation formula of the column vector retention function is:
the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->The retention degree of the ith column vector of the initial covariance matrix is obtained, and Norm is a linear normalization function; />An ith column vector of the initial covariance matrix,/>,/>1 st element in the ith column vector of the initial covariance matrix,/th element>For the 2 nd element in the i-th column vector of the initial covariance matrix,n is the number of the elements in the ith column vector of the initial covariance matrix; />For the increment of each element in the ith column vector of the initial covariance matrix, ++>For column vector->And column vector->Cosine similarity between->For the variance function.
8. The method for optimized matching of information in a large-scale talent library as claimed in claim 7, wherein determining the maximum non-zero value for each column vector of the initial covariance matrix comprises:
in the calculation formula of the column vector retention function, the increment corresponding to the maximum retention of the ith column vector is taken as the maximum non-zero value corresponding to the ith column vector of the initial covariance matrix.
9. The method for optimizing and matching information of a large-scale talent intelligent library according to claim 1, wherein updating an initial covariance matrix according to a maximum non-zero value corresponding to each column vector to obtain a regularized and optimized covariance matrix comprises:
for any column vector of the initial covariance matrix, adding each element in the column vector with the maximum non-zero value corresponding to the column vector to obtain a new column vector; and acquiring each new column vector, and taking a matrix formed by each new column vector as a regularized and optimized covariance matrix.
10. The method for optimizing and matching information of a large-scale talent database according to claim 1, wherein the step of performing data preprocessing on information data of all talent individuals to obtain respective column vectors corresponding to the talent database comprises the steps of:
converting information data of all talents into initial column vectors by OneHot coding to obtain each initial column vector; and carrying out standardization processing on the initial column vector to obtain a standardized initial column vector, and taking the standardized initial column vector as a column vector.
CN202311706817.6A 2023-12-13 2023-12-13 Large-scale talent intelligence library information optimization matching method Active CN117390297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311706817.6A CN117390297B (en) 2023-12-13 2023-12-13 Large-scale talent intelligence library information optimization matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311706817.6A CN117390297B (en) 2023-12-13 2023-12-13 Large-scale talent intelligence library information optimization matching method

Publications (2)

Publication Number Publication Date
CN117390297A true CN117390297A (en) 2024-01-12
CN117390297B CN117390297B (en) 2024-02-27

Family

ID=89463515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311706817.6A Active CN117390297B (en) 2023-12-13 2023-12-13 Large-scale talent intelligence library information optimization matching method

Country Status (1)

Country Link
CN (1) CN117390297B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449098A (en) * 2020-03-25 2021-09-28 中移(上海)信息通信科技有限公司 Log clustering method, device, equipment and storage medium
CN114091904A (en) * 2021-11-22 2022-02-25 中交第一公路勘察设计研究院有限公司 Enterprise migration park recruitment analysis method based on artificial intelligence algorithm
CN115329895A (en) * 2022-09-06 2022-11-11 南昌大学 Multi-source heterogeneous data noise reduction analysis processing method
CN115456367A (en) * 2022-05-21 2022-12-09 武汉研数聚英网络科技有限公司 Talent data competitiveness matching method for multi-source city data
CN116705337A (en) * 2023-08-07 2023-09-05 山东第一医科大学第一附属医院(山东省千佛山医院) Health data acquisition and intelligent analysis method
CN116701725A (en) * 2023-08-09 2023-09-05 匠达(苏州)科技有限公司 Engineer personnel data portrait processing method based on deep learning
CN116739541A (en) * 2023-08-15 2023-09-12 湖南立人科技有限公司 Intelligent talent matching method and system based on AI technology

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449098A (en) * 2020-03-25 2021-09-28 中移(上海)信息通信科技有限公司 Log clustering method, device, equipment and storage medium
CN114091904A (en) * 2021-11-22 2022-02-25 中交第一公路勘察设计研究院有限公司 Enterprise migration park recruitment analysis method based on artificial intelligence algorithm
CN115456367A (en) * 2022-05-21 2022-12-09 武汉研数聚英网络科技有限公司 Talent data competitiveness matching method for multi-source city data
CN115329895A (en) * 2022-09-06 2022-11-11 南昌大学 Multi-source heterogeneous data noise reduction analysis processing method
CN116705337A (en) * 2023-08-07 2023-09-05 山东第一医科大学第一附属医院(山东省千佛山医院) Health data acquisition and intelligent analysis method
CN116701725A (en) * 2023-08-09 2023-09-05 匠达(苏州)科技有限公司 Engineer personnel data portrait processing method based on deep learning
CN116739541A (en) * 2023-08-15 2023-09-12 湖南立人科技有限公司 Intelligent talent matching method and system based on AI technology

Also Published As

Publication number Publication date
CN117390297B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN112289391A (en) Anode aluminum foil performance prediction system based on machine learning
CN111026741A (en) Data cleaning method and device based on time series similarity
Tan et al. SRAGL-AWCL: A two-step multi-view clustering via sparse representation and adaptive weighted cooperative learning
CN111612319A (en) Load curve depth embedding clustering method based on one-dimensional convolution self-encoder
CN114332500A (en) Image processing model training method and device, computer equipment and storage medium
CN113780343A (en) Bilateral slope DTW distance load spectrum clustering method based on LTTB dimension reduction
CN114093445B (en) Patient screening marking method based on partial multi-marking learning
CN117390297B (en) Large-scale talent intelligence library information optimization matching method
Xie et al. The fast clustering algorithm for the big data based on K-means
CN112154453A (en) Apparatus and method for clustering input data
CN114140657A (en) Image retrieval method based on multi-feature fusion
US20230259818A1 (en) Learning device, feature calculation program generation method and similarity calculator
CN104616027A (en) Non-adjacent graph structure sparse face recognizing method
CN112215490A (en) Power load cluster analysis method based on correlation coefficient improved K-means
CN111091243A (en) PCA-GM-based power load prediction method, system, computer-readable storage medium, and computing device
Ding et al. Time-varying Gaussian Markov random fields learning for multivariate time series clustering
CN116401528A (en) Multi-element time sequence unsupervised dimension reduction method based on global-local divergence
Tan et al. “Feature fusion multi-view hashing based on random kernel canonical correlation analysis
CN114528917A (en) Dictionary learning algorithm based on SPD data of Riemannian manifold cut space and local homoembryo
CN114492786A (en) Visual transform pruning method for alternative direction multipliers
Ahn et al. Clustering algorithm for time series with similar shapes
Tani et al. A new algorithm for medical images indexing based on wavelet transform and principal component analysis
Gong et al. Visual Clustering Analysis of Electricity Data Based on t-SNE
CN110210003A (en) One kind being based on symbol entropy of transition data statistical analysis method
Kumar et al. A novel concept specific deep learning for disease treatment prediction on patient trajectory data model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant