WO2023273081A1

WO2023273081A1 - Clustering method, clustering apparatus, and non-transitory computer-readable storage medium

Info

Publication number: WO2023273081A1
Application number: PCT/CN2021/128674
Authority: WO
Inventors: Ruige ZHUANG
Original assignee: Zhejiang Dahua Technology Co., Ltd.
Priority date: 2021-07-02
Filing date: 2021-11-04
Publication date: 2023-01-05
Also published as: CN113255841B; CN113255841A

Abstract

A clustering method, clustering apparatus and computer-readable storage medium are provided. The method includes: acquiring a number of first image files including multiple to-be-clustered images, each image has an attribute combination; calculating similarities among the images in each of the first image files as intra-class similarities, and constructing an intra-class similarity distribution based on the intra-class similarities; calculating similarities among the images with a same attribute combination in all the first image files as inter-class similarities, and constructing an inter-class similarity distribution based on the inter-class similarities; merging the images corresponding to the attribute combination satisfying a preset merging condition based on the intra-class similarity distribution and the inter-class similarity distribution, to obtain at least one second image file; clustering all the second image files to obtain a clustering result. Thus, the accuracy of image clustering is improved.

Description

CLUSTERING METHOD, CLUSTERING APPARATUS, AND NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM

TECHNICAL FIELD

The present disclosure relates to the field of image processing technology, in particular to a clustering method, a clustering apparatus, and a computer-readable non-transitorystorage medium.

BACKGROUND

Currently, an intelligent processing technology can be used to determine whether two images belong to the same clustering target (such as people, cars or other animals) . For example, a face recognition technology can be used to determine which two face images belong to the same person, and a clustering technology can be used to determine which face images belong to the same person. However, the main problem of the existing image clustering technology is that when faced with a large number of images with different scenarios and different attributes, it is inevitable that there are multiple clustering targets in one clustering file or one clustering target exists in multiple clustering files, that is, there is aproblem of inaccurate clustering.

SUMMARY OF THE DISCLOSURE

The present disclosure provides a clustering method, a clustering apparatus, and a computer-readable non-transitory storage medium, which may improve the accuracy of image clustering.

According to a first aspect, a clustering method is provided and includes: acquiring a plurality of first image files, each of the first image files comprises a plurality of to-be-clustered images, each of the to-be-clustered images is with an attribute combination; calculating similarities among the to-be-clustered images of each of the first image files, recording the similarities as intra-class similarities, and constructing an intra-class similarity distribution based on the intra-class similarities; calculating similarities among to-be-clustered images with a same attribute combination in all of the first image files, recording the similarities as inter-class similarities, and constructing an inter-class similarity distribution based on the inter-class similarities; merging to-be-clustered images corresponding to attribute combinations satisfying a preset merging condition based on the intra-class similarity distribution and the inter-class similarity distribution, to obtain at least one second image file; and performing a clustering process on all of the at least one second image file, to obtain a clustering result.

According to a second aspect, aclustering apparatus is provided and includes a memory and a processor connected to the memory. The memory is configured to store a computer program, and the computer program is configured to perform the clustering method in the above when being executed by the processor.

According to a third aspect, a computer-readable non-transitorystorage medium is provided and is configured to store a computer program. The computer program is configured to perform the clusteringmethod in the above when being executed.

According to the present disclosure, a plurality of first image files are obtained firstly, each of theto-be-clustered images in the first image files has an attribute combination. The to-be-clustered images in each of the first image filesare processed to construct an intra-class similarity distribution and an inter-class similarity distribution of different attribute combinations. Subsequentlyto-be-clustered images corresponding to attribute combinationssatisfying a preset merging conditionare merged based on the intra-class similarity distribution and the inter-class similarity distribution to generate at least one second image file. A clustering process is performed on the second image files to obtain a clustering result, resulting that an off-line image clustering is achieved; Since differencesinattribute combinations of images are considered, differences in similarity distributions caused by different attribute combinations may be smoothed; the problem that different attribute combinations result with a poor recall or accuracy rate may be effectively alleviated, and the accuracy of clustering may be improved.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly describe the technical solutions in the embodiments of the present disclosure, the following will briefly introduce the drawings required in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 is a flow chart of a clustering method according to an embodiment of the present disclosure.

FIG. 2 is a flow chart of a clustering method according to another embodiment of the present disclosure.

FIG. 3 is a flow chart of an operation 210 in the embodiment shown in FIG. 2.

FIG. 4 is a structural schematic view of a clustering apparatus according to an embodiment of the present disclosure.

FIG. 5 is a structural schematic view of a computer-readable non-transitory storage medium according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

The technical solutions in the embodiments of the present disclosure will be clearly and comprehensively described in the following by referring to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only a part of, but not all of, the embodiments of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by a person of ordinary skill in the art without making creative work shall fall within the scope of the present disclosure.

In the art, a large number of face images can be clustered by virtue of aface clustering technology, andface photographs belonging to the same person can be classified into one category, that is, "one person corresponding to only one file" , which is an ideal situation. However, in fact, there are many situations of "one person corresponding to multiple files" (that is, the face photographs of one person exists in multiple files) or "multiple people in one file" (that is, there may be multiple face photographs from different people in one file) in the clustering process.

The main reason for the above problem is that the differences between the imagesimilarity distributionsin the files of different people are immense. For example, based on the actual data, the error rate ofa file of a person such as an elder, a child or a female wearing a mask is higher compared to a file of an adult male. For these data, the difficultyof clustering them into one file should be improved, suchthata false positive may be reduced. Although there are some solutionsproposing that a dynamic threshold is adopted to improve the accuracy and recall rate, where distribution differencesofface attribute information as well as face similarities under different attributes are not taken into account. In an actual clustering under a same threshold, the false positive rate of a child and a female wearing a mask is much higher than that of an adult male. In general, a threshold set to be 92 points can be used to well divided a face of a male, while a thresholdmay be needed to be set to be 95 points or even higher for a child and a female wearing a mask, for reason that it is easy for a high similarity situation to happen among different children and females wearing a mask. However, when the threshold is set too high, a large number of pictures of an adult male that can be recalled may be lost, which in turn may cause anobvious problem of "one person corresponding to multiple files" .

Another reason is that, due to factors such as an angle, a face quality or a pixel size, it may appear thatsimilarities offace images from two different peoplearetoo high under a small probability, resulting thatthe face imagesfrom two people appear in one file. Moreover, with the impact of thelow-quality image, more and more face images of the person corresponding to the low-quality image may be aggregated into the file, and originallyone single noise image (that is, theface image of the person that the file is not corresponding to) may be expanded into a plurality of noise images. For the convenience of description, a single noise image existing in a file is referred to as a single point noise; when there are multiple noise images in one file, the images are referred to as group noises.

The concept of the attribute combination adopted in the present disclosure is described in the following. The attribute combination is a combination of attributes of theto-be-clustered images. Takingthe to-be-clustered images as face images for example, the attribute combination includesa race, an age, a gender, whether wearing a mask, orwhether wearing glasses. Further, whenthe division is madebased on the age attribute of a face, it may be divided into the elderly, the middle year, the teenager, and the child; based on the gender attribute of a face, it may be divided into male and female; when whether wearing a mask is taken into account as well, 16 kinds of combinations of attributes can be obtained; in response to a subdivision based on other face attributes, more attribute combinations can be obtained.

In order to solve a similarity distribution difference problem among different attribute combinations, the present disclosure provides a method. That is, intra-class similarity distributions and inter-class similarity distributions of different attribute combinations are constructed firstly based on the annotated first image files; and the similarity distributions (including intra-class similarity distributions andinter-class similarity distributions) of different attribute combinations are merged; a clustering process is performed. With the similarity distributions of different attribute combinations combined, the problem that there are multiple clustering targets in one file can be effectively solved, and the problem that the same clustering target corresponding to multiple files can be alleviated to a certain extent.

Referring to FIG. 1, FIG. 1 is a flow chart of a clustering method according to an embodiment of the present disclosure. The method includes following operations.

In an operation 11: a plurality of first image files are acquired.

A plurality of first image files are acquired firstly, each ofthe first image files includes a plurality of to-be-clustered images, and each of theto-be-clustered images has an attribute combination. For different types of to-be-clustered images, attribute combinations may be different. For example, when the to-be-clustered images are face images, the attribute combinationsinclude a race, an age, a gender, whether wearing a mask or whether wearing glasses; when the to-be-clustered images arelicense plate images, the attribute combinations mayincludea license plate shape, a license plate color or a license plate number. Specifically, the first image files can be created by a manual annotation, and each of theto-be-clustered images is identified by a program developer to divide the to-be-clustered images belonging to the same category into the same group; as a result, a plurality of first image filesare obtained. It is to be understood that other means may also be adopted to obtain first image files, such as directly obtaining the first image files from an image database, or taking a classification result of other clustering models as the first image files.

Further, each of the first image files also hasan attribute combination, and the number of to-be-clustered images corresponding to different attribute combinations in each of the first image files can be counted. Anattribute combination with the largest number is regarded asthe attribute combination of each of the first image files. For example, in the case where the to-be-clustered images are face images, for a first image file that has beenannotated as the same person, an attribute combination of each face image in the file can be counted, andthe attribute combination that appears the most times are regarded as the attribute combination of the file.

In an operation 12: similarities among the to-be-clustered imagesof each of the first image files are calculated, which arerecorded as intra-class similarities, and an intra-class similarity distribution isconstructed based on the intra-class similarities.

After first image files are obtained, to-be-clustered images in each of the first image files can be processed. For example, features of the to-be-clustered images may beextracted by means of a feature extraction method, andsimilarities between the features of any of two to-be-clustered images are calculated, which are recorded asintra-class similarities (that is, in the file) . An intra-class similarity distributionis constructed with all the intra-class similarities.

In an operation 13: similaritiesamongto-be-clustered images with a same attribute combination in all of the first image files are calculated, which are recorded as inter-class similarities, and an inter-class similarity distributionis constructed based on the inter-class similarities.

After an attribute combination of each of the first image files is obtained, a merging process on all the first imagefiles is performedbased on attribute combinations, that is, at least two first image files with a same attribute combination are merged. A file generated after the merging process can be recorded as an attribute image file, and the to-be-clustered images in the attribute image file are processed. For example, features ofto-be-clustered images are extracted, andsimilarities between the features of any two to-be-clustered images of the attribute image file are calculated, which are recorded as inter-class similarities (that is, among the files) . An inter-class similarity distribution is constructed with all the inter-class similarities.

Through the processing described above, the final result is that each attribute combination corresponds to one similarity distribution. For example, when there are k (k≥1) attribute combinations D ₁、D ₂、D ₃、……、D _k, accordingly there are k intra-class similarity distributions F ₁、F ₂、F ₃、……、F _k. The F ₁、F ₂、F ₃、……、F _krepresent intra-class similarities under different attribute combinations respectively. Likewise, similarities among different files under a same attribute combinationare calculated to obtain k inter-class similarity distributionsF′ ₁、F′ ₂、F′ ₃、……、F′ _k.

In an embodiment, in order to construct an intra-class similarity distribution and aninter-class similarity distribution, apreset similarity range may be divided into a preset number of subranges, which may be [0, 1] or [0%, 100%] . The number of intra-class similaritiesthat are falling in each range and that of inter-class similarities may be counted respectively, which are recorded as a first number and a second number accordingly. The first number is divided by a total number of the intra-class similarities to obtain a probability value corresponding to a corresponding one ofthe intra-class similarities, and an intra-class similarity distribution may be constructed based on all of theintra-class similarities and the corresponding probability values. The second number is divided by atotal number of the inter-class similarities to obtain a probability value corresponding to a corresponding one ofthe inter-class similarities, and an inter-class similarity distribution may be constructed based on all of the inter-class similarities and the corresponding probability values.

For example, when there are M1 images in a first image file, similarities between any two to-be-clustered images of the first image file may be calculated to obtain M2 intra-class similarities; the preset similarityrange is [0, 1] , which is divided into 10 subranges: 0-0.1、0.1-0.2、……、0.9-1. The probability value that each ofthe M2 intra-class similarities falls in each subrangeis calculated. For instance, whena probability value of falling in the interval (0.9, 1] is 0.9 and a probability value of falling in the interval (0.8, 0.9] is 0.2, thereby a corresponding relationship between the intra-class similarities and the probabilityvalues is obtained.

In an operation 14: based on the intra-class similarity distribution and the inter-class similarity distribution, a merging process forto-be-clustered imagescorresponding to the attribute combinationsthat are satisfying a preset merging condition is performed to obtain at least one second image file.

The preset merging condition is a pre-determined condition to determinewhether two attribute combinations can be merged. After the intra-class similarity distribution and the inter-class similarity distribution are obtained, the to-be-clustered images corresponding to the attribute combination with a similar or same similarity distribution can be merged together based on the intra-class similarity distribution and the inter-class similarity distribution, to obtain at least one second image file. It can be seen that the number of the second image files is less than the number of the first image files.

Understandably, whensimilarity distributions corresponding to any two attribute combinations fail to satisfy the preset merging condition, there is no need to perform the merging process, and an operation 15 is executed directly.

In an operation 15: a clustering process on all of the second image files is performed to obtain a clustering result.

A clustering process is performed on at least one of the second image files by means of a clustering method so as to obtain a corresponding clustering result, and accordingly the clustering process is completed. Specifically, one clustering method may be adopted for clustering once, or multiple clustering methods are adopted for clustering; for example, one clustering method (such as hierarchical clustering method or density clustering method) is adopted firstly, and obtained clustering categoriesare regarded as initial clustering categoriesofa clustering progress for the second time; another clustering method (such as k-means clustering method) is adopted for the second time clustering in the following processing.

It is understandable that the clustering method provided bythe present embodiment can be applied to a technical field of face recognition, whenthe to-be-clustered images are face images.

The present embodiment provides a clustering method based onsimilarity distributions of attribute combinations of the to-be-clustered images. Similarity distributions of different attribute combinations are constructed firstly (that is, a mapping relationship between the attribute combinations and the similarity distributions is constructed firstly) ; to-be-clustered images corresponding tothe attribute combinations with a similar or same similarity distribution are merged together to generate at least one second image file; subsequently, a clustering process on all of the second image files is performed to obtain a clustering result, such that an offline image clustering is achieved. Since the attribute combination of an image is considered, the difference in similarity distributions caused by different attribute combinations may be smoothed; the problem that different attribute combinations result with a poor recall or accuracy rate may be effectively alleviated, and the accuracy and recall rate of clustering may be improved.

Referring to FIG. 2, FIG. 2 is a flow chart of a clustering method according to another embodiment of the present disclosure. The method includes following operations.

In an operation 201: a plurality of first image files are acquired.

In an operation 202: similarities among the to-be-clustered imagesof each of the first image files are calculated, which are recorded as intra-class similarities, and an intra-class similarity distribution is constructed based on the intra-class similarities.

In an operation 203: similarities among to-be-clustered images with a same attribute combination in all of the first image files are calculated, which are recorded as inter-class similarities, and an inter-class similarity distribution is constructed based on the inter-class similarities.

The operations 201 to 203 are the same withthe operations11 to 13 in the embodiment described above, and will not be repeated hereinafter.

In an operation 204: two attribute combinations are selected from all attribute combinations as a first attribute combination and a second attribute combination. The intra-class similarities and the inter-class similarities of the first attribute combination are obtained, and the intra-class similarities and the inter-class similarities of the second attribute combination are obtained.

Each attribute combination has a corresponding intra-class similarity distribution and an inter-class similarity distribution. Different attribute combinations may have a similar similarity distribution. In order to determine whether the similarity distributions corresponding tothe attribute combinationsare similar, similarities of similarity distributions of any two attribute combinations can be compared. Further, the first attribute combination and the second attribute combination may be selected in an order of random selection, or a sequence identifier is createdfor each attribute combination, andthe first attribute combination and the second attribute combination are selected in turn based on the sequence identifier.

In an operation 205: similarities between the intra-class similarities of the first attribute combination and those of the second attribute combination are calculated, to obtain afirst distribution similarity; similarities between the inter-class similarities of the first attribute combination and those of the second attribute combination are calculated, to obtain asecond distribution similarity.

After the similarity distributions of the first attribute combination and those of the second attribute combination are obtained, similarities of the intra-class similarity distribution and those of the inter-class similarity distribution can be compared respectively, that is, similarities between the intra-class similarities of the first attribute combination and those of the second attribute combination are calculated; similarities between the inter-class similarities of the first attribute combination and those of the second attribute combination are calculated.

In an operation 206: it is determined whether the first distribution similarity and the second distribution similarity satisfya preset merging condition.

After the first distribution similarity and the second distribution similarity are calculated, a distribution similarities measurement methodmay be adopted to determine whether the two distribution similaritiessatisfy the preset merging condition, such as a KL (Kullback-Leibler) divergence、aF-divergence or a Wasserstein distance.

In an embodiment, the KL divergence is used to measure the difference between the two distributions, that is, the first distribution similarity includesa first intra-class divergence and a second intra-class divergence; the second distribution similarity includes a first inter-class divergence and a second inter-class divergence. A KL divergence of the intra-class similarities of the first attribute combination relative to those of the second attribute combination is calculated firstly to obtain the first intra-class divergence. A KL divergence of the intra-class similarities of the second attribute combination relative tothose of the first attribute combination is calculated to obtain the second intra-class divergence. A KL divergence of the inter-class similarities of the first attribute combination relative to those of the second attribute combination is calculated to obtain the first inter-class divergence; A KL divergence of the inter-class similarities of the second attribute combination relative tothoseof the first attribute combination is calculated to obtain the second inter-class divergence. In the following processing, it is determined whether the first intra-class divergence is less than a first preset value, the second intra-class divergence is less than the first preset value, the first inter-class divergenceis less than a second preset value, the second inter-class divergence is less than the second preset value; that is, the following formula may be used to determine whether two similarity distributions are merged.

In the formula, F _irepresents intra-class similarities of the first attribute combination, F _jrepresentsintra-class similarities of the second attribute combination; F′ _i represents inter-classsimilarities of the first attribute combination, F′ _j represents inter-class similarities of the first attribute combination; 1≤I, j≤k, krepresents the total number ofattribute combinations corresponding to all thefirst image files, and i≠j; threshold _KLrepresents the first preset value and threshold′ _KL represents the second preset value; both the first preset value and the second preset value are the threshold that is predetermined. D _KL (F _i||F _j) represents the first intra-class divergence; D _KL (F _j||F _i) represents the second intra-class divergence; D _KL (F′ _i||F′ _j) represents the first inter-class divergence; D _KL (F′ _j||F′ _i) represents the second inter-class divergence.

The operations 201-206 in the above are performed repeatedlyuntil all of the attribute combinations are traversed.

In an operation 207: when the preset merging condition issatisfied, the to-be-clustered images corresponding to the first attribute combination and those corresponding to the second attribute combination are merged to obtain the second image file.

When the similarity distributions of the first attribute combination and those of the second attribute combination satisfy the preset merging condition, F _j and F _i are merged into the same one distribution, F′ _iand F′ _jare merged into the same one distribution. After a merging process based on the KL divergence, there are k attribute combinations D ₁, D ₂, D ₃, ..., D _K, correspondingly there are d intra-class similarity distributions F ₁, F ₂, F ₃, ..., F _d and d inter-class similarity distributions F′1、F′ ₂、F′ ₃、……、F′ _d; d is less than k, that is, there are multiple attribute combinations corresponding to the same one similarity distribution.

After a merging processbased on the similarities of attribute combinations, the first time clustering can be performed firstly to determine whether there aregroup noises in the file. The second time clustering can be performed to filter the group noises. Subsequently, an abnormal noise detection can be performed to filter a single point noise, specifically, as shown in operations 208-212.

In an operation 208: the second image file is regarded as a current to-be-clustered file.

In an operation 209: a first preset clustering algorithm is adopted to perform a clustering process on the current to-be-clustered fileto obtain a plurality of third image files.

The first preset clustering algorithm may be adopted for the first time clustering. The first preset clustering algorithm may be a density clustering algorithm which includes Density-based Spatial Clustering of Application with Noise (DBSCAN) 、Incremental Grid Density-Based Clustering Algorithm (IGDCA) 、Order Points to Identify the Clustering Structure (OPTICS) 、the Largest Set of Nov-Cored Core Points (LSNCCP) and so on. In the present example, the DBSCAN density clustering algorithm is adopted to generate the first time clustering file.

In operation 210: it is determined whether a preset clustering termination condition is met, based on intra-class similarity distributionof each of the third image files and inter-class similarity distribution of each of the third image files.

When the intra-class similarity distributionof each of the third image files and the inter-class similarity distribution of each of the third image files satisfy the preset clustering termination condition, it is indicated that the clustering is over; in the case, a filtering processon the single point noise can be performed, that is, the operation 212 can be performed.

In an embodiment, as shown in FIG. 3, it is determined whether the preset clustering termination conditionis met by the following operations.

In an operation 31: an attribute combination of a third image file is obtained, which is recorded as a current attribute combination.

For each of the third image files, the number of to-be-clustered images corresponding to each attribute combination in each of the third image files is counted. For example, when the number of to-be-clustered images in a third image file is N, there are k attribute combinations D ₁、D ₂、D ₃、...... 、D _K, and the number of images corresponding to each of the attribute combinations is N ₁、N ₂、N ₃、...... 、N _Krespectively. When N _i is a maximum value amongN ₁、N ₂、N ₃、...... 、N _K, D _i is regardedas the attribute combination of the third image file.

In an operation 32: the intra-class similarity distribution and the inter-class similarity distribution of the current attribute combination are obtained.

The intra-class similarity distribution F _i and the inter-class similarity distributionF′ _iof the current attribute combination are obtained.

In an operation 33: an intra-class probability and an inter-class probability are calculated based on the intra-class similarity distributionand the inter-class similarity distribution of the current attribute combination.

The intra-class similarity distribution is constructed based on intra-class similarities of the attribute combination andcorresponding probability. Probability valuescorresponding to all of the intra-class similarities of the current attribute combination are multiplied to obtain the intra-class probability. The inter-class similarity distribution is constructed based on inter-class similarities of the attribute combination and corresponding probability. Probability values corresponding to all of the inter-class similarities of the current attribute combination are multiplied to obtain the inter-class probability.

Further, for any of similarities p, a probability that thesimilaritycorresponds to in the similarity distribution can be represented asF _i (p) andF′ _i (p) respectively. Due to the number of images ina file is N, there are

pairs of to-be-clustered images, which means there are

similarities. Whenthe similarities are represented by p ₁、p ₂、p ₃、……、

anoccurrence probability of theset of similarities can be calculated based onanoccurrence probability of a single similarity; that is, a probability that all of the to-be-clustered imagesin the file belong to a same clustering target (that is, the intra-class probability) is represented by

whilea probability that all of the to-be-clustered imagesin the file belong to different clustering targets (that is, the inter-class probability) is represented by

When the smaller the intra-class probability P is, the greater the probability of group noisesexisting in the file is; the greater the inter-class probabilityP' is, the greater the possibility of group noisesexisting in the file is.

In an operation 34: it is determined whether the intra-class probability and the inter-class probability satisfy the preset clustering termination condition.

It is determined whether the intra-class probability is greater than or equal to a first preset probability threshold and the inter-class probability is less than or equal to a second preset probability threshold. The first preset probability threshold and the second preset probability threshold are the probability threshold that is predetermined. It is determined that there are group noises in the current to-be-clustered file when the intra-class probability is less than the first preset probability threshold or the inter-class probability is greater than the second preset probability threshold. The second time clustering is requiredfor the file with group noises.

In an embodiment, a second preset clustering algorithm is the K-means clustering algorithm. When there aregroup noises in a third image file, a clustering process on the third image file is performed by the K-means clustering algorithm. When there are no group noises in the third image file, considering that there is still a possibility for a single point noise to exist, it may be requiredto be determined whether there is a single point noise existing in the third image file first. In case that there is a single point noise existing in the third image file, a filtering process is performed on the single point noise. Understandably, the single point noise may also be filtered directly without a detection, that is, the operation 212 is performed.

In an operation 211: when the preset clustering termination conditionis not met, a clustering process is performed on each of the third image filesbya second preset clustering algorithm to obtain a plurality of fourth image files, each of which is regarded as the current to-be-clustered file.

When the preset clustering termination conditionis not met, the possibilitythat there are group noises in a third image file is indicated to be large. Given that the file is a result afterthe density clusteringprocessing once, the number of group noises tends not to be much, at which time the second preset clustering algorithm can be adopted to perform a clustering process on the third image files, to obtain a plurality of fourth image files. Each of the fourth image filesis regarded as a current to-be-clustered file, and the first preset clustering algorithm is adopted to perform a clustering process on the current to-be-clustered files, that is, the operation isreturned to the operation 209. Taking face images as an example, there may be a "multiple people in one file" situation in the file after the density clustering is performed once. Considering a probability of the false positive, the number of different people in one file generally tends to exceed three.

Considering the above situation, the K-means clustering algorithm is adopted for the second timefiltering. Further, in view of the number of group noises, a parameter k of the K-means is set to be 2 (that is, the number of the fourth image files is 2) , so as to ensure that two purer files are generated after the K-means clustering. A file A offace imagesis taken as an example in the following for explanation.

The file A generates a file A1 and a file A2 after the K-means clustering. A probability that to-be-clustered images in the file A corresponding to the same one person isP _A, and a probability thatto-be-clustered images in the file Acorresponding to different people isP′ _A.

When the probability that the to-be-clustered images in the file A1 corresponding to the same person and different people is P _A1 and P′ _A1 respectively; the probability that the to-be-clustered images in the file A2 corresponding to the same person and different people is P _A2 and P′ _A2 respectively; it is obvious that P _A< P _A1 and P _A< p _A2, which may ensure the files become purerafter the K-means clustering but cannot guarantee that the intra-class probability and the inter-class probability of the files A1 and A2 satisfy the preset clustering termination condition described above.

An operation 212 is ready to be performedwhenthe intra-class probability and the inter-class probability of the file A1 satisfy the preset clustering termination condition. When the preset clustering termination condition fails to be satisfied, the operation is returned to the operation 209 again to performan iteration process. Similarly, the file A2 is operated in the same way. In this way, it may be ensured that all files are relatively pure when the operation 212 is going to be performed, and the possibility that there aregroup noisesexisting in the files is already quite small.

The operation 212: a filtering process ofa single point noise is performed.

When a filtering operation of the single point noise is ready to be performed, means that all files are high probability files, but there is still a possibility that single point noises existing in the files. For face images, asingle point noise may be caused by multiple problems such as a face angle, face occlusion, animage pixel or image quality. With respect toasingle point noise, the single point noise is filtered based on the following two solutions in the present embodiment.

1) an intra-class similarity distribution of a third image file is calculated, which is recorded as afirst intra-class similarity distribution. Similarities between each of the to-be-clustered imagesof the third image file and other images of the third image file are calculated, and an intra-class similarity distribution is calculated based on the similarities, which is recorded as a second intra-class similarity distribution. Similarities between the first intra-class similarity distribution and the second intra-class similarity distribution are calculated, which arerecorded as distributionsimilarities. Ato-be-clustered image that is corresponding to a maximum value of the distribution similarities is regarded as a single point noise, and the single point noise is deleted from the third image file.

For reason that the single point noise is essentially an outlier in the file, similaritiesof the single point noise present a different distribution from similarities of other points in the file. For example, when the intra-class similarity distribution of the file is F and there are Nto-be-clustered images in the file, the similarities between each of the to-be-clustered imagesand other to-be-clustered images arecalculated, and (N-1) similarities can be obtained. The (N-1) similarities constitute one intra-class similarity distribution. With a similar operation repeated, one intra-class similarity distribution of each of theto-be-clustered images can be obtained, and finally a whole intra-class similarity distribution F ₁、F ₂、F ₃、……、F _N is obtained.

As for each of the intra-class similarity distributions, differences between each of the intra-class similarity distributionsand the whole intra-class similarity distribution F of the file are calculated. For example, KL is used to measure the difference in the intra-class similarity distribution, and ato-be-clustered image with the largest intra-class similarity distribution difference is filtered as a single point noise.

2) Anabnormal value detection (also known as an outlier detection) algorithm is adopted to perform a filtering process on the third image file to filter out the single point noise in the third image file.

The outlier detection is a detection process to find objects whose behavior are quite different from the expected one. The objects are referred to asabnormal points or outliers. The outlier detection algorithm has specific applications in the real life, such as a credit card fraud, an industrial loss detection or an image detection. The outlier detection algorithm can be adopted to detect the single point noise in image clustering, for example, an isolated forest algorithm or a random cut forest (RCF) algorithm may be adopted to filter the single point noise.

Understandably, a new file can be created and the single point noise detected is put into the filefor subsequent use.

In the present embodiment, the intra-class probability and the inter-class probability of the file are determined by virtue of the probabilities of similarities, and a possibility thata noise existing in a file may befurther determined. Regarding group noises, amanner of combiningthe DBSCAN density clustering with the K-means clustering is adopted. A high recall rate of DBSCAN and the K-means iteration clustering may ensure the filtering forgroup noises, such that the group noisesof the file are filtered out as much as possible. After the first time filtering, the similarity distribution difference and the outlier detection algorithm are introduced to detect the single point noise to further obtain a purer file, and the problem that multiple clustering targets existing in one filemay be alleviated effectively.

To sum up, in order to solve the problem that there aredifferences in similarity distributions between different attribute combinations, the present disclosure provides a clustering method. Intra-class similarity distributions andinter-class similarity distributions of different attribute combinationsare constructed firstly based on the annotated files. The similarity distributions of different attribute combinations are classifiedby virtue of the KL divergence. When a clustering is performed on a file, the first time clustering is completed by the density clustering algorithm, subsequently the attribute combinations of all images in the file are counted to determine a whole attribute combination of the file; based on the similarity distributions corresponding to the attribute combinations and the similaritiesin the file, a probability that all images of the file correspond to the same person and a probability that all images of the file do not correspond to the same person are obtained. The K-means clustering is performed in the files whose probabilities fail to satisfy the preset clustering termination condition, resulting that group noises may be effectively filtered out. With each of the files as a unit, the single point noise in each of the files can be filtered based on the similarity distribution difference of the single point noise or the outlier detection algorithm. Withthe similarity distributions of different attribute combinations combined andtwo-layer noise filtering adopted, the problem of "multiple people in one file" may be effectively solved and the problem of "one person corresponding to multiple files" may be alleviated to a certain extent.

Referring to FIG. 4, FIG. 4 is a structural schematic view of a clustering apparatus according to an embodiment of the present disclosure. The clustering apparatus 40 includes a memory 41 and a processor 42 connected to the memory 41. The memory 41 is configured to store a computer program. The computer program is configured to implement the clustering method in the above embodiment when being executed by the processor 42.

Referring to FIG. 5, FIG. 5 is a structural schematic view of a computer-readable non-transitory storage medium according to an embodiment of the present disclosure. The computer-readable non-transitory storage medium 50 is configured to store a computer program 51. The computer program 51 is configured to implement the clustering method in the above embodiment when being executed.

The computer readable non-transitory storage medium 50 may be a server, a universal serial bus disk, a mobile hard drive, a read-only memory (ROM, Read-Only Memory) , a random access memory (RAM, Random Access Memory) , a magnetic disk or an optical disk and various media that can store program codes.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed methods and the apparatus can be implemented by other means. For example, the apparatus in the above described embodiments is merely exemplary. For example, modules or units are divided based on logical functions only but can be divided in another way practically. For example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.

The units illustrated as separated components may or may not be physically separated. A component displayed as a unit may or may not be a physical unit. That is, the component may be located in one place or distributed to multiple network units. A part of or the entire unit can be selected according to practical needs to achieve the purpose of the present disclosure.

In addition, each functional unit in each implementation of the present disclosure can be integrated in one single processing unit or physically present separately. Alternatively, two or more units can be integrated in one single unit. The above integrated unit can be implemented either in a form of hardware or in a form of software functional units.

The above shows only an example of the present disclosure, but does not limit the scope of the present disclosure. Any equivalent structure or equivalent process transformation based on the specification and the accompanying drawings of the present disclosure, directly or indirectly applied in other related fields, shall be included in the scope of the present disclosure.

Claims

A clustering method, comprising:

acquiring a plurality of first image files, each of the first image filescomprises a plurality of to-be-clustered images, each ofthe to-be-clusteredimagesis with an attribute combination;

calculating similaritiesamong the to-be-clustered images of each of the first image files, recording the similarities as intra-class similarities, and constructing an intra-class similarity distribution based on the intra-class similarities;

calculating similarities among to-be-clustered images with a same attribute combination in all of the first image files, recording the similaritiesas inter-class similarities, and constructing an inter-class similarity distribution based on the inter-class similarities;

merging to-be-clustered imagescorresponding to attribute combinationssatisfying a preset merging condition based on the intra-class similarity distribution and the inter-class similarity distribution, to obtain at least one second image file; and

performing a clustering process on all of the at least one second image file, to obtain a clustering result.
The clustering method according to claim 1, wherein theperforming a clustering process on all the second image files to obtain a clustering result, comprises:

taking the second image file as a current to-be-clustered file;

performing a clustering process on the current to-be-clustered file by a first preset clustering algorithm, to obtain a plurality of third image files;

determining whether a preset clustering termination condition is met, based on intra-class similarity distribution of each of the third image files and inter-class similarity distribution of each of the third image files; and

in response to a result of the above determining is no, performing a clustering process oneach of the third image files by a second preset clustering algorithm to obtain a plurality of fourth image files, takingeach of the fourth image files as the current to-be-clustered file, and returning to the operation of performing a clustering process on the current to-be-clustered file by a first preset clustering algorithm.
The clustering method according to claim 2, comprising:

obtaining an attribute combination of a third image file, and recording the attribute combination as a current attribute combination;

acquiring the intra-class similarity distribution and the inter-class similarity distribution of the current attribute combination;

calculating an intra-class probability and an inter-class probability based on the intra-class similarity distribution of the current attribute combination and the inter-class similarity distribution of the current attribute combination; and

determining whether the intra-class probability and the inter-class probability satisfy the preset clustering termination condition.
The clustering method according to claim 3, wherein theintra-class similarity distribution is constructed based on intra-class similarities of the attribute combination and corresponding probabilities, the inter-class similarity distribution is constructed based on inter-class similarities of the attribute combination and corresponding probabilities, the calculating an intra-class probability and an inter-class probability based on the intra-class similarity distribution of the current attribute combination and the inter-class similarity distribution of the current attribute combination, comprises:

multiplying probability values corresponding to all of the intra-class similarities of the current attribute combination, to obtain the intra-class probability; and

multiplying probability values corresponding to all of the inter-class similarities of the current attribute combination, to obtain the inter-class probability.
The clustering method according to claim 4, further comprising:

dividing a preset similaritiesrange into a preset number of subranges;

counting the number of the intra-class similarities falling in each of the subranges, which is recorded as a first number; counting the number of the inter-class similarities falling in each of the subranges, which is recorded as a second number;

dividing the first number by a total number of the intra-class similarities to obtain a probability value corresponding to a corresponding one of the intra-class similarities, andconstructing the intra-class similarity distribution based on all of the intra-class similarities and the corresponding probability values; and

dividing the second number by a total number of the inter-class similarities to obtain a probability value corresponding to a corresponding one of the inter-class similarities, andconstructing the inter-class similarity distribution based on all of the inter-class similarities and the corresponding probability values.
The clustering method according to claim 3, wherein the determining whether the intra-class probability and the inter-class probability meet the preset clustering termination condition, comprises:

determining whether the intra-class probability is greater than or equal to a first preset probability threshold, and whether the inter-class probability is less than or equal to a second preset probability threshold; and

in response to a result of the above determining is no, determining that the current to-be-clustered file has group noisesexisting therein.
The clustering method according to claim 6, wherein the first preset clustering algorithm is a density clustering algorithm, the second preset clustering algorithm is a k-means clustering algorithm, and the clustering method further comprises:

in response to the third image file having the group noises existing therein, performing the clustering process on the third image file by adopting the K-means clustering algorithm; and

in response to the third image file not having the group noises existing therein, performing a filtering process of a single point noise.
The clustering method according to claim 7, wherein the performing a filtering process of a single point noise, comprises:

calculating anintra-class similarity distributionof the third image file and recording the intra-class similarity distribution as a first intra-class similarity distribution;

calculating similarities between each of the to-be-clustered images of the third image file and other images of the third image file, calculating an intra-class similarity distribution based on the similarities, and recording the intra-class similarity distribution as a second intra-class similarity distribution;

calculatingsimilarities between the first intra-class similarity distribution and the second intra-class similarity distribution, and recording the similarities as distribution similarities; and

taking a to-be-clustered image corresponding to a maximum value of the distribution similarities as a single point noise, and deleting the single point noise from the third image file.
The clustering method according to claim 7, wherein the performing a filtering process of a single point noise, further comprises:

performing a filtering process on the third image file by adopting an outlier detection algorithm, to filter out the single point noise in the third image file.
The clustering method according to claim 1, wherein themerging to-be-clustered images corresponding to attribute combinations satisfying a preset merging condition based on the intra-class similarity distribution and the inter-class similarity distribution, to obtain at least one second image file, comprises:

selecting two attribute combinations from all attribute combinations as a first attribute combination and a second attribute combination;

obtaining intra-class similarities and inter-class similarities of the first attribute combination, and obtaining intra-class similarities and inter-class similarities of the second attribute combination;

calculating similarities between the intra-class similarities of the first attribute combination and those of the second attribute combination, to obtain a first distribution similarity;

calculatingsimilarities between the inter-class similarities of the first attribute combination and those of the second attribute combination to obtain asecond distribution similarity;

determining whether the first distribution similarity and the second distribution similarity satisfya preset merging condition;

in response to the above determining is yes, merging the to-be-clustered images corresponding to the first attribute combination and those corresponding to the second attribute combination, to obtain the second image file; and

performing the above operations repeatedlyuntil all of the attribute combinations are traversed.
The clustering method according to claim 10, wherein the first distribution similarity includesa first intra-class divergence and a second intra-class divergence, the second distribution similarity includes a first inter-class divergence and a second inter-class divergence, and the clustering method further comprises:

calculating a KL divergence of the intra-class similarities of the first attribute combinationrelative to thoseof the second attribute combination to obtain the first intra-class divergence;

calculating a KL divergence of the intra-class similarities of the second attribute combination relative tothose of the first attribute combination to obtain the second intra-class divergence;

calculating a KL divergence of the inter-class similarities of the first attribute combination relative tothose of the second attribute combination to obtain the first inter-class divergence; and

calculating a KL divergence of the inter-class similarities of the second attribute combination relative tothose of the first attribute combination to obtain the second inter-class divergence.
The clustering method according to claim 11, whereinthe determining whether the first distribution similarity and the second distribution similarity meet a preset merging condition, comprises:

determining whether the first intra-class divergence is less than a first preset value, whether the second intra-class divergence is less than the first preset value, whether the first inter-class divergence is less than a second preset value, and whether the second inter-class divergence is less than the second preset value.
The clustering method according to claim 1, further comprising:

counting the number of the to-be-clustered images corresponding to different attribute combinations in each of the first image files; and

taking anattribute combination of the to-be-clustered images with alargest number, as the attribute combination of each of the first image files.
The clustering method according to claim 1, wherein

the to-be-clustered images are face images, and the attribute combinationscomprises an attribute combination ofrace, an attribute combination of age, an attribute combination of gender, an attribute combination of whether wearing a mask, or an attribute combination of whether wearing glasses.
A clustering apparatus, comprising a memory and a processor connected to the memory, wherein the memory is configured to store a computer program, and the computer program is configured to perform the clustering method according to any one of claims 1-14 when being executed by the processor.
A computer-readable non-transitorystorage medium, configured to store a computer program, wherein the computer program is configured to perform the clustering method according to any one of claims 1-14 when being executed.