CN119441250A - Method and device for population diffusion based on seed population - Google Patents
Method and device for population diffusion based on seed population Download PDFInfo
- Publication number
- CN119441250A CN119441250A CN202411457046.6A CN202411457046A CN119441250A CN 119441250 A CN119441250 A CN 119441250A CN 202411457046 A CN202411457046 A CN 202411457046A CN 119441250 A CN119441250 A CN 119441250A
- Authority
- CN
- China
- Prior art keywords
- user
- seed
- samples
- crowd
- compression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000009792 diffusion process Methods 0.000 title claims abstract description 28
- 238000000034 method Methods 0.000 title claims description 40
- 230000006835 compression Effects 0.000 claims abstract description 85
- 238000007906 compression Methods 0.000 claims abstract description 85
- 239000013598 vector Substances 0.000 claims abstract description 75
- 238000012512 characterization method Methods 0.000 claims abstract description 37
- 230000000875 corresponding effect Effects 0.000 claims description 17
- 230000002596 correlated effect Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 description 20
- 230000004069 differentiation Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000007480 spreading Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the specification provides a crowd diffusion method based on seed crowd, which comprises the steps of performing feature compression on characterization vectors of all user samples in a crowd data set to obtain compression codes corresponding to all the user samples, wherein the crowd data set comprises a seed user set formed by seed user samples belonging to target crowd. The crowd data set is divided into a plurality of user groups according to the compression encoding. A discrimination indicator is determined for each user group with respect to the seed user set. And determining a plurality of user groups according to the distinguishing degree index of each user group, and classifying the user samples in the user groups into the target crowd.
Description
Technical Field
One or more embodiments of the present disclosure relate to the field of user big data mining and analysis, and in particular, to a method and apparatus for population diffusion based on seed population.
Background
Crowd spreading is a common technique for user mining and crowd subdivision, and aims to discover and screen larger-scale target crowd similar to seed crowd from a crowd data set containing massive user samples based on existing seed crowd, wherein the target crowd can be crowd with higher conversion potential in marketing activities or crowd with certain characteristics and needing special business treatment. The target crowd can be accurately and rapidly found out, corresponding service can be conveniently and well provided for the target crowd or targeted business processing can be conducted, so that user experience is improved, and business effects are also improved.
Therefore, a scheme is hoped to be available, seed populations can be efficiently analyzed in a population data set with a large number of user samples through a technical means, target populations similar to the seed populations can be found, and population diffusion efficiency is improved.
Disclosure of Invention
One or more embodiments of the present disclosure describe a method and apparatus for population diffusion based on seed population, using compression code matching with low computational complexity, grouping a large number of user samples, and searching for target population in units of groups, thereby reducing time complexity to solve the above-mentioned technical problems.
According to a first aspect, there is provided a method for population diffusion based on seed population, comprising:
feature compression is carried out on the characterization vectors of all user samples in the crowd data set to obtain compression codes corresponding to all user samples, wherein the crowd data set comprises seed user sets formed by seed user samples belonging to target crowds.
The crowd data set is divided into a plurality of user groups according to the compression encoding.
A discrimination indicator is determined for each user group with respect to the seed user set.
And determining a plurality of user groups according to the distinguishing degree index of each user group, and classifying the user samples in the user groups into the target crowd.
According to one implementation, the compression encoding is an M-bit encoding, and the feature compression includes:
and performing coding operation on a first characterization vector of any first user sample, and determining a corresponding first compression code, wherein the element of the j-th bit is used for representing the first position relationship between the first characterization vector and the N-th hyperplane vector.
In one scenario of the above implementation, the first position relationship is a directed distance, the compression encoding is binary encoding, and the encoding operation includes:
And if the directed distance between the first characterization vector and the Nth hyperplane vector is smaller than 0, coding the j-th bit element into 0.
And if the directed distance between the first characterization vector and the Nth hyperplane vector is not less than 0, coding the j-th bit element into 1.
According to one implementation, for any two user samples, a first distance between their compression encodings is inversely related to a first similarity between the token vectors.
In one scenario of the above implementation, the first distance is a hamming distance, and the first similarity is a cosine similarity.
According to one implementation, the dividing the crowd data set into user groups includes dividing user samples having the same compression coding into the same group.
According to one implementation, the determining the discrimination index of each user group with respect to the seed user set includes:
For any user group, a first number of the seed user samples contained therein in the seed user set is determined as a ratio to a second number of the user samples contained therein in the crowd data set.
A discrimination indicator of the user packet is determined that is positively correlated with the first number of duty cycles and negatively correlated with the second number of duty cycles.
According to one implementation, the determining a number of user groups, classifying the user samples therein into the target group, includes:
And sequencing the plurality of user groups according to the respective distinguishing degree indexes to obtain a first sequence.
And sequentially classifying the user samples in the current user group into the target crowd along the first sequence until a preset condition is reached.
In one scenario of the above implementation, the preset condition includes that the number of user samples in the target crowd reaches a preset first number, or that the number of processed user packets reaches a preset second number.
According to one implementation, the determining a number of user groups, classifying the user samples therein into the target group, includes:
and determining a plurality of user groups with the distinguishing index larger than a preset threshold value from the plurality of user groups.
And classifying the user samples in the user groups into the target crowd.
According to a second aspect, there is provided an apparatus for population diffusion based on seed populations, comprising:
The compression module is configured to perform feature compression on the characterization vectors of all user samples in the crowd data set to obtain compression codes corresponding to all user samples, wherein the crowd data set comprises a seed user set formed by seed user samples belonging to target crowd.
And the grouping module is configured to divide the crowd data set into a plurality of user groups according to the compression codes.
A calculation module configured to determine a discrimination indicator for each user group with respect to the seed user set.
And the determining module is configured to determine a plurality of user groups according to the distinguishing degree index of each user group and classify the user samples in the user groups into the target crowd.
According to a third aspect, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising a memory and a processor, characterized in that the memory has stored therein executable code, which when executed by the processor, implements the method of the first aspect.
In the method and the device provided by the embodiment of the specification, the storage space of the user characteristic data can be effectively reduced by carrying out characteristic compression on the characteristic vectors of a large number of user samples. And converting the representation similarity calculation among the user samples into code matching with lower calculation complexity by using compression coding, and dividing similar user sample groups. And then group retrieval is carried out by taking the group as a unit, so that violent search of one-to-one comparison is avoided on a large number of user samples. Thereby remarkably improving the population diffusion efficiency and expanding the application range.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a block diagram of a method for population diffusion based on seed populations disclosed in the embodiments of the present disclosure;
FIG. 2 is a flowchart of a method for population diffusion based on seed populations according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of an apparatus for population diffusion based on seed population according to an embodiment of the present disclosure.
Detailed Description
The embodiments of the present invention will be described below with reference to the accompanying drawings.
As previously mentioned, population spread is an important component in user analysis. As a common technique for user mining and community subdivision, the objective is to retrieve a larger target population from a large number of users that is similar to or related to a seed population, based on a known seed population.
In real business scenarios (e.g., social network operations, e-commerce recommendations, advertisement delivery, etc.), it is often necessary to find more target populations with similar characteristics according to one seed population (e.g., a user population with a low social relationship degree with a certain user, a user population who purchases a certain commodity, a user population who generates clicking actions on a certain advertisement, etc.), so that a seed population with a small sample size is enlarged to a target population with more user samples, and accurate marketing is performed on such target populations, so that the same marketing effect as that of the seed population can be generally obtained. Has important significance for expanding marketing coverage and improving user conversion rate.
The feature of each user sample can be expressed using a characterization vector. The user's token vector is a compact and representative vector generated by processing the original features of each user sample. The raw features of the user sample may include demographic information, user behavior data, user interaction records, etc., which are typically high-dimensional, sparse. The original features can be converted into a multidimensional real-valued vector which is used as a characterization vector of a user sample by using common characterization learning methods such as word embedding, factor decomposition and the like on the basis of the original features. Thus, the similarity between the user samples can be determined by comparing the characterization vectors capable of expressing the characteristics of the user samples among different user samples.
In order to realize population diffusion, in a related technology, a Top-N user sample most similar to each seed user sample is obtained as a target population by performing characterization similarity calculation on the user samples and sorting the similarities. In this case, in order to be able to comprehensively evaluate the total number of user samples in the crowd data set, it is necessary to inevitably perform a feature similarity calculation between each seed user sample and the total number of user samples, and then perform a brute force search on the total number of user samples. In this way, population diffusion can be realized on a crowd data set with small data volume, and once the total user sample number in the crowd data set reaches a certain scale (in an internet service system, the user sample number in the crowd data set can reach hundreds of millions), the one-to-one calculation and violent retrieval mode becomes slower and slower, and cannot meet the real-time requirement.
On the one hand, the computation of the similarity between the token vectors is often measured as the distance or angle between the two token vectors. For example, for two k-dimensional token vectors a and B:
the characterization similarity between them can be measured using euclidean distances:
The unfolding is as follows:
Sqrt((A1-B1)2+(A2-B2)2+…+(Ak-Bk)2) (2)
That is, in this manner, the calculation of the token similarity between a pair of token vectors requires k token vector element difference calculations, k square (product) calculations, and 1 square calculation.
The token similarity between them can also be measured using cosine similarity:
Wherein:
A·B=A1*B1+A2*B2+…+Ak*Bk (4)
|A|=Sqrt((A1)2+(A2)2+…+(Ak)2) (5)
|B|=Sqrt((B1)2+(B2)2+…+(Bk)2) (6)
that is, in this manner, the token similarity between a pair of token vectors is calculated, and k token vector element product calculations, and 2k square (product) calculations are required to obtain the modulo length of the token vector.
Therefore, in order to calculate the similarity between the token vectors, multiple product operations positively correlated with the token vector dimension are required, and the calculation complexity is high.
On the other hand, in the face of a crowd data set with a large number of user samples, each seed user sample in the seed user set needs to be subjected to characterization similarity calculation with each user sample in the crowd data set, and the calculation mode is equivalent to carrying out Cartesian product expansion on the seed user set and the crowd data set, so that huge calculation cost is brought, and the time complexity is O (n 2).
Through the analysis, it can be seen that in the crowd data set with a large number of user samples, the overall data is subjected to violent retrieval by calculating the characterization similarity one to one, so that a crowd diffusion mode is realized, a large amount of computing resources are consumed, the retrieval efficiency is low, and the real-time requirement cannot be met.
In order to solve the above-mentioned problems, in an embodiment of the present disclosure, a method and an apparatus for population diffusion based on seed population are provided. On the one hand, by carrying out feature compression on the characterization vector of the user sample, compression coding which occupies smaller storage space and can correctly express the feature similarity relation between different user samples is obtained. On the other hand, by using matching computation between compression codes, instead of characterizing similarity computation that requires consuming a large amount of computation resources, a large number of user samples are grouped according to similarity. On the other hand, the grouping close to the seed crowd is determined by calculating and comparing the differentiation indexes of the different groupings, so that the target crowd is obtained, and violent searching of a full amount of user samples is avoided.
In fig. 1, a structure diagram of a method for population diffusion based on seed population is shown, in a population data set with a large number of user samples, a part of user samples can be defined as seed user samples according to business rules to form seed population, and the goal of population diffusion is to search out user samples similar to the seed user samples in the population data set as target population based on the seed population. Firstly, feature compression is carried out on each user sample in the crowd data set, and the feature compression is to compress the characterization vector of the user sample into compression coding which occupies less data space, and the compression coding can correctly reflect the similarity relation among different user samples. Then, the compression codes are matched, and user samples with the same or similar compression codes (according to the precision requirement of crowd spreading) are divided into the same group. According to the number of seed user samples and non-seed user samples contained in the groups, the sensitivity of each group to seed groups can be determined and used as a differentiation index of the groups. And finally, sorting the groups according to the differentiation index of each group, sequentially taking out user samples from the sorted groups, and classifying the user samples into target groups, so that the target groups which can meet the service requirements can be obtained.
The technical idea is followed. In fig. 2, a flowchart of a method for population diffusion based on seed population is shown, according to an embodiment of the present disclosure. It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. Referring to fig. 2, in one embodiment, the method includes at least the following steps. S201, feature compression is carried out on the characterization vectors of all user samples in the crowd data set to obtain compression codes corresponding to all user samples, wherein the crowd data set comprises seed user sets formed by seed user samples belonging to target crowd. And S203, dividing the crowd data set into a plurality of user groups according to the compression codes. S205, determining a discrimination index of each user group with respect to the seed user set. S207, determining a plurality of user groups according to the distinguishing degree index of each user group, and classifying the user samples in the user groups into the target crowd.
Specific implementations of the above steps will be described in detail below with reference to the accompanying drawings.
In the crowd data set, a mass of user samples are contained, wherein part of the user samples belong to seed user samples, and the seed user samples form a seed user set. It should be noted that in some embodiments, the seed user sample may also be from a user data set other than the crowd data set, and the source of the seed user sample is not limited in this embodiment.
Based on the crowd data set, in step S201, feature compression is performed on the feature vectors of each user sample in the crowd data set, so as to obtain compression codes corresponding to each user sample. As described above, the characterization vector of the user sample may be obtained by sampling, based on the original characteristics of the user sample. In the step, feature compression is carried out on the characterization vectors of all user samples in the crowd data set so as to obtain compression codes with smaller data volume and can correctly reflect the similarity among different user samples.
In a specific implementation, various feature compression methods can be adopted, for example, dimension reduction can be achieved by projecting the characterization vector into a low-dimensional space, dimension reduction can be achieved on the characterization vector by performing principal component analysis on the characterization vector, retaining part of main feature information therein, and the like.
In one particular implementation, a hashing algorithm may be employed to create a corresponding digital "fingerprint" based on the token vector. The hash function may compress the token vector data such that the amount of data becomes smaller.
Assuming that the token vector of the user sample is a real-valued vector of d-dimension, the token vector of user sample iThe goal of feature compression is compression encoding that compresses a token vector into M bits. In some typical practices, M has a value of 20.
First, M d-dimensional hyperplane vectors are randomly acquired
And then, carrying out coding operation on the characterization vector u i of any user sample i, and determining corresponding compression codes, wherein the j-th bit element is used for representing the position relationship between the characterization vector and the j-th hyperplane vector.
Through the above feature compression operation, the characterization vector of d dimension can be compressed into compression encoding with M-bit data. Further, in one specific scenario, the token vector may be compressed into a binary code string, and the above-mentioned positional relationship may be represented by a directional distance s ij between the token vector u i and the hyperplane r j, and the binary code e ij of the user sample i is generated by alignment according to the sign of s ij, as compression coding thereof:
eij=1(sij≥0),j=1,2,...,M (8)
Wherein 1 (·) is an indicator function, where the above formula indicates that if the characterization vector u i of the user sample i is smaller than 0, the directed distance s ij between the user sample i and the j-th hyperplane is smaller than 0, the j-th bit element of the compression coding is encoded as 0, otherwise, the j-th bit element of the compression coding is encoded as 1. Thus, the compressed code e i=(ei1,ei2,…,eiM)∈{0,1}M of the user sample i can be obtained.
It should be understood that, in the above implementation, the element encoding bits (j) of the compression encoding are in one-to-one correspondence with the bit times (j) of the hyperplane where the token vector is compared, but this correspondence is not represented by a definition of the element encoding bits of the compression encoding. In other implementations, the element encoding bits of the compression encoding may correspond to the bit order misalignment of the comparison hyperplane, but at least there is a certain contrast relationship between the element of the compression encoding and the bit order of the comparison hyperplane, for example, a check information bit is inserted before each element representing the position relationship, and then the contrast relationship between the element of the compression encoding and the position relationship is that the j-th bit element is used for representing the position relationship between the characterization vector and the j-th hyperplane vector.
In this scenario, mapping of the token vector of the high-dimensional real-valued vector to a low-dimensional binary coded space is achieved. And, for any two user samples, the higher the characterization similarity in the original d-dimensional space, the closer the distance in the encoding space of the compression codes respectively corresponding to the user samples, and vice versa. That is, for any two user samples, the distance between their compression encodings is inversely related to the similarity between their token vectors.
In one example, for any two user samples, the distance between their compression encodings may be represented using a hamming distance, and the similarity between their token vectors may be represented using a cosine similarity. That is, the hamming distance between compression codes obtained after feature compression can be used to measure the included angle between the original token vectors. In other examples, other metrics that represent similarity between vectors, and other metrics that represent similarity between codes, may be used to measure the relationship between compression codes and the token vector. This is not specifically recited in the examples herein.
After the compression codes corresponding to the user samples in the crowd data set are obtained, the user samples can be divided into different groups according to the compression codes. Thus, in step S203, the crowd data set is divided into a plurality of user groups according to the compression encoding. Since the compression coding can correctly reflect the similarity between different user samples, user samples with similar characteristics can be identified according to the compression coding, and the user samples are divided into the same group.
According to one implementation, user samples with the same compression coding may be partitioned into the same group. Assuming that the set of binary compression codes isAccording to a rule that a code combination is divided into one packet, 2 M different packets can be divided, and the kth packet is denoted as B k, which comprises compression coding asIs a sample of all users:
Bk={i|ei=ck},k=1,2,...,2M (9)
The user samples with similar original feature vectors can be obtained by the feature compression operation, and the similar user samples are divided into the same group by the grouping operation. In this way, in the subsequent target crowd retrieval, according to the requirement of the service on the retrieval precision of the target crowd, the user sample is only selected in one or more groups with similar characteristics. In the calculation process, the feature similarity calculation between every two of the total user samples is avoided, feature compression is only carried out on each user sample, and the time complexity of the feature similarity calculation is reduced from O (n 2) to constant level O (n).
Next, in step S205, for each group, a degree of distinction index of the group for the seed user set is calculated. In each group, since a plurality of user samples with similar characteristics are contained, the degree of similarity between the user samples contained in the group and the seed user samples can be determined by evaluating the expression sensitivity of the group to the seed crowd characteristics. In general, when a group includes user samples with a higher frequency in a seed user set and a lower frequency in a crowd data set including a large number of non-seed user samples, the compression coding corresponding to the group can be indicated, and the group has a remarkable category distinguishing capability for the seed user set. Accordingly, the user samples contained in the group may be categorized as part of the target group.
According to one implementation, for any user group, a first number of duty cycles of seed user samples contained therein in the seed user set and a second number of duty cycles of user samples contained therein in the crowd data set are first determined. A discrimination indicator of the user packet is then determined, the discrimination indicator being positively correlated with the first number duty cycle and negatively correlated with the second number duty cycle.
Assume the seed user set isCrowd data set isThe number ratio of kth group B k in the seed user set and the crowd data set is:
first number ratio:
The second number ratio:
Where k=1, 2,..2 M, |·| represents the cardinality of the collection, i.e., the number of elements.
For any group B k, its degree of discrimination for seed user set D k can be defined as:
The larger D k indicates that the higher the frequency of occurrence of the user samples contained in group B k in the seed user set, and the lower the frequency of occurrence in the crowd data set, the higher the degree of differentiation of group B k to the seed user set.
According to another implementation, the contribution of the group to the target population, i.e., the information value (Information Value, IV) of the group, can also be used as a distinguishing index of the group to the seed user set. This implementation can be generalized as follows:
Assume the seed user set is Crowd data set isThe number ratio of the kth group B k in the seed user set and the crowd data set is the first number ratio shown in equation (10) and equation (11), respectivelyAnd a second number of duty cycles
The idea of information value, for any group B k, its degree of distinction for seed user set D k, can be defined as:
The larger D k indicates a greater contribution of group B k to the target population, and also indicates a greater degree of differentiation of group B k to the seed user set.
After the processing of the above steps, the discrimination index of each packet is obtained by evaluation. Next, in step S207, a plurality of user groups are determined according to the discrimination index of each user group, and the user samples are classified into the target crowd. In the step, according to the differentiation index of the groups and the similarity requirement of the business on the target group relative to the seed group in the group diffusion, a plurality of groups meeting the business requirement can be determined, and the user samples contained in the groups are selected as the target group, so that the group diffusion task based on the seed group can be completed.
According to one implementation, the packets may be ordered according to a discrimination indicator of the packets, and then the ordered packets, or user samples in the packets, may be conditionally selected to meet the traffic demand. The implementation can be generalized in that the plurality of user groups are ordered according to respective discrimination indicators to obtain a first sequence. It is assumed that in the resulting first sequence, the index of each packet can be expressed as:
then, it can be known that the distinguishing index relationship between different packets is:
And then, sequentially classifying the user samples in the current user group into the target crowd along the first sequence until a preset condition is reached. In a specific scenario, the preset condition may be flexibly set according to the service requirement.
In one example, the preset condition may be that the number of user samples in the target group reaches a preset first number. Under the preset condition, the target crowd obtained by crowd diffusionCan be expressed as:
Wherein m is the first number set in the preset condition, that is, the minimum number of user samples required in the target crowd. t is the minimum number of packets required to meet the preset condition:
In another example, the preset condition may be that the number of the processed user packets reaches a preset second number. Under the preset condition, the expected group number can be set, and then the user samples in the groups are orderly classified into target groups according to the first sequence, and when the processed group number reaches the expected group number, the group diffusion task can be completed.
The above is a description of one implementation of selecting a sample of users to enter a target group. In some embodiments, according to another implementation manner, according to the service requirement, a group meeting the condition is selected first, and then the user sample in the selected group is divided into target groups. The implementation mode can be summarized in that a plurality of user groups with the distinguishing index larger than a preset threshold value are determined from the plurality of user groups, and user samples in the plurality of user groups are classified into the target crowd.
Under the preset condition, the target crowd obtained by crowd diffusionCan be expressed as:
Wherein n is the preset threshold value set according to the service requirement, namely the lowest value which should be satisfied by the distinguishing degree index corresponding to the selected packet. Therefore, the user samples with the similarity with the seed user samples not exceeding the preset threshold can be included in the target crowd category, so that the crowd diffusion task based on the seed crowd is realized.
In accordance with one or more embodiments, a method for population spread based on seed populations is described above in detail. By adopting the method provided by the embodiment of the specification, the compression coding with smaller storage space can be obtained by compressing the characterization vector of the user sample in the process of implementing crowd diffusion, and the user sample is distributed into different groups based on the compression coding, and the retrieval of the target crowd is carried out by taking the groups as units. The method can effectively convert the original high-computation-complexity representation similarity computation into simple and efficient coding matching computation, and simultaneously, the group search can reduce the crowd-spread time complexity to a constant level, so that the crowd-spread computation efficiency is improved.
In this specification, the first user sample, the first number of words "first" in the term "first number" and the corresponding words "second", "third" (if any), etc. are used herein for convenience of distinction and description only and are not in any limiting sense.
The foregoing describes certain embodiments of the present disclosure, other embodiments being within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. Furthermore, the processes depicted in the accompanying figures are not necessarily required to achieve the desired result in the particular order shown, or in a sequential order. In some embodiments, multitasking and parallel processing are also possible, or may be advantageous.
Fig. 3 is a schematic diagram of an apparatus for population diffusion based on seed population according to an embodiment of the present disclosure. The apparatus 300 is deployed in a computing device that may be implemented by any means, device, platform, cluster of devices, etc. having computing, processing capabilities. This embodiment of the device corresponds to the embodiment of the method shown in fig. 2. The apparatus 300 includes:
The compression module 301 is configured to perform feature compression on the characterization vector of each user sample in the crowd data set to obtain compression codes corresponding to each user sample, where the crowd data set includes a seed user set formed by seed user samples belonging to the target crowd.
A grouping module 302 is configured to divide the crowd data set into a plurality of user groups according to the compression encoding.
A calculation module 303 is configured to determine a discrimination indicator for each user group with respect to the seed user set.
The determining module 304 is configured to determine a plurality of user groups according to the distinguishing degree index of each user group, and classify the user samples therein into the target crowd.
According to an embodiment of a further aspect, the present description also provides a computer program product comprising a computer program/instruction which, when executed by a processor, carries out the steps of the method described above in connection with fig. 2.
According to an embodiment of yet another aspect, the present disclosure further provides a computing device, including a memory and a processor, wherein the memory stores executable code, and the processor, when executing the executable code, implements the steps of the method described above in connection with fig. 2.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The foregoing detailed description of the embodiments of the present invention further details the objects, technical solutions and advantageous effects of the embodiments of the present invention. It should be understood that the foregoing description is only specific to the embodiments of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.
Claims (13)
1. A method for population spread based on seed population, comprising:
feature compression is carried out on the characterization vectors of all user samples in the crowd data set to obtain compression codes corresponding to all user samples, wherein the crowd data set comprises seed user sets formed by seed user samples belonging to target crowd;
dividing the crowd data set into a plurality of user groups according to the compression codes;
determining a distinguishing degree index of each user group about the seed user set;
And determining a plurality of user groups according to the distinguishing degree index of each user group, and classifying the user samples in the user groups into the target crowd.
2. The method of claim 1, wherein the compression encoding is an M-bit encoding, and wherein the feature compression comprises:
Obtaining M hyperplane vectors;
and carrying out coding operation on a first characterization vector of any first user sample, and determining a corresponding first compression code, wherein the element of the j-th bit is used for representing the first position relation between the first characterization vector and the N-th hyperplane vector.
3. The method of claim 2, wherein the first positional relationship is a directed distance, the compression encoding is binary encoding, and the encoding operation comprises:
if the directed distance between the first characterization vector and the Nth hyperplane vector is smaller than 0, coding the j-th bit element to be 0;
and if the directed distance between the first characterization vector and the Nth hyperplane vector is not less than 0, coding the j-th bit element into 1.
4. The method of claim 1, wherein, for any two user samples, a first distance between their compression encodings is inversely related to a first similarity between the token vectors.
5. The method of claim 4, wherein the first distance is a hamming distance and the first similarity is a cosine similarity.
6. The method of claim 1, wherein the dividing the crowd data set into user groupings comprises dividing user samples having the same compression coding into the same groupings.
7. The method of claim 1, wherein the determining a discrimination indicator for each user group with respect to a seed user set comprises:
for any user group, determining a first number of the seed user samples contained in the seed user set and a second number of the user samples contained in the crowd data set;
a discrimination indicator of the user packet is determined that is positively correlated with the first number of duty cycles and negatively correlated with the second number of duty cycles.
8. The method of claim 1, wherein the determining a number of user groupings, including user samples therein into the target population, comprises:
Sequencing the plurality of user groups according to respective discrimination indexes to obtain a first sequence;
And sequentially classifying the user samples in the current user group into the target crowd along the first sequence until a preset condition is reached.
9. The method of claim 8, wherein the predetermined condition comprises a number of user samples in the target group reaching a predetermined first number, or a number of processed user packets reaching a predetermined second number.
10. The method of claim 1, wherein the determining a number of user groupings, including user samples therein into the target population, comprises:
determining a plurality of user groups with the distinguishing index larger than a preset threshold value from the plurality of user groups;
and classifying the user samples in the user groups into the target crowd.
11. A device for population diffusion based on seed population, comprising:
The compression module is configured to perform feature compression on the characterization vectors of all user samples in the crowd data set to obtain compression codes corresponding to all user samples, wherein the crowd data set comprises a seed user set formed by seed user samples belonging to target crowd;
A grouping module configured to divide the crowd data set into a plurality of user groups according to the compression encoding;
A computing module configured to determine a discrimination indicator for each user group with respect to the seed user set;
and the determining module is configured to determine a plurality of user groups according to the distinguishing degree index of each user group and classify the user samples in the user groups into the target crowd.
12. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1-10.
13. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411457046.6A CN119441250A (en) | 2024-10-16 | 2024-10-16 | Method and device for population diffusion based on seed population |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411457046.6A CN119441250A (en) | 2024-10-16 | 2024-10-16 | Method and device for population diffusion based on seed population |
Publications (1)
Publication Number | Publication Date |
---|---|
CN119441250A true CN119441250A (en) | 2025-02-14 |
Family
ID=94504745
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411457046.6A Pending CN119441250A (en) | 2024-10-16 | 2024-10-16 | Method and device for population diffusion based on seed population |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119441250A (en) |
-
2024
- 2024-10-16 CN CN202411457046.6A patent/CN119441250A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ioffe | Improved consistent sampling, weighted minhash and l1 sketching | |
CN105469096B (en) | A kind of characteristic bag image search method based on Hash binary-coding | |
US8908978B2 (en) | Signature representation of data having high dimensionality | |
Zan et al. | An improved symbolic aggregate approximation distance measure based on its statistical features | |
EP2657884B1 (en) | Identifying multimedia objects based on multimedia fingerprint | |
CN112994701B (en) | Data compression method, device, electronic equipment and computer readable medium | |
CN110825894B (en) | Data index establishment method, data retrieval method, data index establishment device, data retrieval device, data index establishment equipment and storage medium | |
US20220414144A1 (en) | Multi-task deep hash learning-based retrieval method for massive logistics product images | |
CN110059288B (en) | System and method for obtaining an optimal mother wavelet for facilitating a machine learning task | |
CN112434084A (en) | Trajectory similarity matching method and device based on geohash and LCSS | |
CN108229358B (en) | Index establishing method and device, electronic equipment and computer storage medium | |
CN112036476A (en) | Data feature selection method and device based on two-classification service and computer equipment | |
CN110472659B (en) | Data processing method, device, computer readable storage medium and computer equipment | |
US11886445B2 (en) | Classification engineering using regional locality-sensitive hashing (LSH) searches | |
CN113515662A (en) | Similar song retrieval method, device, equipment and storage medium | |
CN116738009B (en) | Method for archiving and backtracking data | |
CN106557668A (en) | DNA sequence dna similar test method based on LF entropys | |
CN119441250A (en) | Method and device for population diffusion based on seed population | |
CN115757900A (en) | User demand analysis method and system applying artificial intelligence model | |
Koga et al. | Effective construction of compression-based feature space | |
Thangavel et al. | Optimization of code book in vector quantization | |
Massoli | Assessing the Quality in the Detection of Similar Complex Data Structures in Large-Scale Datasets | |
Rayatidamavandi et al. | A comparison of hash-based methods for trajectory clustering | |
Gan et al. | Optimal Dynamic Parameterized Subset Sampling | |
CN112580676A (en) | Clustering method, clustering device, computer readable medium and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication |