Detailed Description
Embodiments of the present disclosure are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present disclosure and are not to be construed as limiting the present disclosure.
The server (e.g., cloud disk) includes various forms such as text files, picture files, audio files, and video files, and has different formats. The traditional personalized file searching method mainly comprises the following steps:
(1) Searching the file according to the search statement of the user;
(2) Recommending related files according to social network information of users, such as friends, groups and the like;
(3) Preprocessing operations such as Word segmentation are carried out on the file text, word2Vec classical Word vector models are adopted to obtain Word vector expressions, weighted summation is carried out to obtain the file vector expressions, and searching recommendation is carried out through vector similarity;
The scheme (1) is simple and easy to realize, but cannot reflect the interests of users; the scheme (2) needs the user to actively maintain own social network information, and easily limits the file searching range, so that the diversity of file recommendation is affected; word2Vec in scheme (3) is insufficient in understanding complex semantic relation and context information, has limited precision, mainly focuses on the text single-mode field, and easily ignores deep semantic information of long text files.
In view of the foregoing, the present disclosure proposes a search method, apparatus, electronic device, and storage medium.
The following describes a search method, apparatus, electronic device, and storage medium of the embodiments of the present disclosure with reference to the accompanying drawings.
Fig. 1 is a flowchart of a search method according to an embodiment of the disclosure.
As shown in fig. 1, the search method may include the steps of:
Step 101, in response to receiving a file search request sent by a client, for any candidate file in a plurality of candidate files associated with the client, acquiring a file name and a file title of the any candidate file, and processing text information of the any candidate file according to a file type of the any candidate file to obtain a text slice set.
As one example, a file uploaded by a client, a file restored by a client, etc., may be considered a plurality of candidate files associated with the client.
It may be understood that the plurality of candidate files associated with the client may include a plurality of file types, for example, the file types of the plurality of candidate files associated with the client may include text, pictures, audio, and the like, so that, in order to accurately represent solicit articles pieces of semantics, a file name and a file title of any candidate file may be obtained, text information of any candidate file may be extracted according to the file type of any candidate file, text information of any candidate file may be subjected to text segmentation according to length information of the text information, and a text slice set corresponding to the text information of any candidate file may be obtained, where the text slice set includes a plurality of text slices.
It should be noted that, since the plurality of candidate files associated with the client may include a plurality of file types, some candidate files (e.g., pictures) of the file types may not include file titles, for a candidate file including a file title in the candidate file, the file title of the candidate file may be directly extracted, for a candidate file not including a file title in the candidate file, text information of the candidate file may be obtained first, and the file title of the candidate file may be extracted from the text information of the candidate file.
Step 102, a first set of slice vectors is generated from the set of text slices, the file name and the file header.
In order to further improve the accuracy of representing file semantics, each text slice in the text slice set of any candidate file, the file name of any candidate file, and the file header may be represented in a vectorization manner to obtain a first slice vector set, a file name vector, and a file header vector of any candidate file, and then, the initial slice vector set, the file name vector, and the file header vector of any candidate file are fused to obtain the first slice vector set of any candidate file.
And step 103, obtaining a preference vector of a target object of the login client and a feature vector of a search keyword in a search statement, and carrying out weighted fusion on the preference vector and the feature vector to obtain a session vector.
In order to meet personalized search of users, on the basis of accurately representing file semantics, target object preference and search keywords of a login client can be combined to perform search.
The preference vector of the target object of the login client can be obtained according to the behavior sequences of the target object on the plurality of candidate files within a set period, the feature vector of the search keyword in the search statement can be obtained by vectorizing the search keyword in the search statement, and the preference vector of the target object of the login client can be used for indicating the preference degree of the target object on each candidate file.
Step 104, determining at least one target file from the candidate files according to the session vector and the first similarity between the slice vectors in the first slice vector set of the candidate files, and sending the target files to the client.
As an example, a first similarity between the session vector and each slice vector in the first set of slice vectors for each candidate file may be calculated, at least one target file may be determined from each candidate file based on the first similarity corresponding to each slice vector in the second set of slice vectors for each candidate file, and each target file may be sent to the client.
As another example, to reduce the complexity of computation, feature dimension reduction is performed on a plurality of slice vectors in a first slice vector of a candidate file for any candidate file to obtain a second slice vector set, feature dimension reduction is performed on a session vector according to the plurality of slice vectors in the first slice vector of the candidate file to obtain a session vector after dimension reduction processing, a first similarity corresponding to each slice vector in the second slice vector set of any candidate file is determined according to a second similarity between each slice vector in the second slice vector set of any candidate file and the session vector after dimension reduction processing, at least one target file is determined from each candidate file according to a first similarity corresponding to each slice vector in the second slice vector set of each candidate file, and each target file is sent to a client.
In summary, in response to receiving a file search request sent by a client, for any candidate file in a plurality of candidate files associated with the client, acquiring a file name and a file title of the any candidate file, and processing text information of the any candidate file according to the file type of the any candidate file to obtain a text slice set; generating a first slice vector set according to the text slice set, the file name and the file title; obtaining a preference vector of a target object of a login client and a feature vector of a search keyword in a search statement in a search request, and fusing the preference vector and the feature vector to obtain a session vector; according to the session vector and the first similarity between slice vectors in the first slice vector set of each candidate file, at least one target file is determined from each candidate file, and each target file is sent to the client, so that the candidate files are subjected to vectorization representation of slice levels based on file types of the candidate files and in consideration of file names and file titles, and simultaneously the session vector obtained by combining the preference vector of the target object and the feature vector of the search keyword in the search statement is determined from a plurality of candidate files, the purpose that the preference and the search keyword of the user are considered on the basis of accurately representing the semantics of the candidate files is achieved, the files required by the user are accurately recalled from each candidate file, the personalized search requirements of different users are met, and the search experience of the user is improved.
To clearly illustrate how in the above embodiments at least one target file is determined from each candidate file according to the session vector and the first similarity between each slice vector in the first set of slice vectors of each candidate file, and each target file is sent to the client, the present disclosure proposes another search method.
Fig. 2 is a flowchart of another search method according to an embodiment of the disclosure.
As shown in fig. 2, the search method may include the steps of:
Step 201, in response to receiving a file search request sent by a client, for any candidate file in a plurality of candidate files associated with the client, acquiring a file name and a file title of the any candidate file, and processing text information of the any candidate file according to a file type of the any candidate file, so as to obtain a text slice set.
Step 202, a first set of slice vectors is generated from the set of text slices, the file name and the file header.
Step 203, obtaining a preference vector of the target object of the login client and a feature vector of the search keyword in the search statement in the search request, and fusing the preference vector and the feature vector to obtain a session vector.
Step 204, performing feature dimension reduction on a plurality of slice vectors in the first slice vector of the candidate file according to any candidate file to obtain a second slice vector set, and performing feature dimension reduction on the session vector according to the plurality of slice vectors in the first slice vector of the candidate file to obtain a session vector after dimension reduction processing.
In order to reduce the computational complexity, as an example, feature dimension reduction may be performed on a plurality of slice vectors and session vectors in the first slice vector of any candidate file, to obtain a second slice vector set and a session vector after dimension reduction processing.
As an example, the trained feature dimension reduction model may be used to perform feature dimension reduction on the plurality of slice vectors and the session vector in the first slice vector of any candidate file, so as to obtain a second slice vector set of any candidate file and the session vector after dimension reduction processing.
As another example, clustering a plurality of slice vectors in the first slice vector of any candidate file to obtain a plurality of vector clusters corresponding to any candidate file; sampling is carried out according to the ratio of the clustering quantity of each vector corresponding to any candidate file to the set sampling quantity, a sample vector set of any candidate file is obtained, and dimension reduction is carried out on a plurality of slice vectors in the first slice vector of any candidate file according to the sample vector set of any candidate file, so that a second slice vector set of any candidate file is obtained.
According to the sample vector set of any candidate file, the dimension of the plurality of slice vectors in the second slice vector of any candidate file is reduced, and the third slice vector set of any candidate file is obtained by the following steps: according to the number of sample vectors in the sample vector set of any candidate file, determining a sample mean vector of the sample vector set of any candidate file; updating each sample vector in the sample vector set of any candidate file according to the sample mean value vector corresponding to any candidate file to obtain an updated sample vector set; determining covariance matrixes of updated sample vector sets corresponding to any candidate file; and performing feature dimension reduction on the plurality of slice vectors in the second slice vector set according to the covariance matrix and the sample mean vector corresponding to any candidate file to obtain a third slice vector set.
That is, assuming that the total number of clusters after clustering of the slices is K, the total number of slice vectors of the cluster K is N cluster,k, the number of sample set slices to be extracted is N sample, the number of vectors randomly sampled in the cluster K is N sample,k Wherein, Representing rounding up, forming a sample set vector X by randomly extracted slices, updating the number of samples to N sample=ΣNsample,k, setting the ith vector in the sample set vector X as X i(1≤i≤Nsample), and determining the sample mean value vector asFor all vector average removal processing in the sample vector set X, namely q= (X 1-vmean,x2-vmean,...xNample-vmean), calculating a covariance matrix QQ T, decomposing eigenvalues, extracting eigenvectors corresponding to the largest first M eigenvalues to construct an eigenvector matrix P, and performing eigenvalue dimension reduction processing on slice vectors according to the sample average vector and the eigenvector matrix, wherein the method specifically comprises the following steps of: Where v d,c denotes each slice vector in the first set of slice vectors, The representation v d,c corresponds to the second slice vector.
And simultaneously, according to the sample vector set of any candidate file, carrying out characteristic dimension reduction on the session vector to obtain the session vector after dimension reduction processing. As one example, a difference vector between the session vector and the sample mean vector of any candidate file is obtained; and obtaining the conversation vector after the dimension reduction processing according to the product of the difference vector and the covariance matrix of any candidate file. For example, the session vector is denoted as v ue, and the session vector after any candidate file is subjected to dimension reduction processingWherein, P represents covariance matrix corresponding to any candidate file, and v mean represents sample mean vector of any candidate file.
Step 205, determining a first similarity corresponding to each slice vector in the first slice vector set of the candidate file according to the second similarity between each slice vector in the second slice vector set and the session vector after the dimension reduction processing.
Further, a second similarity between each slice vector in the second slice vector set of any candidate file and the session vector after the dimension reduction processing is calculated, and a first similarity corresponding to each slice vector in the first slice vector set of any candidate file is determined according to the second similarity between each slice vector in the second slice vector set of any candidate file and the session vector after the dimension reduction processing. For example, the second similarity between each slice vector in the second set of slice vectors of any candidate file and the session vector after the dimension reduction processing may be used as the first similarity corresponding to each slice vector in the first set of slice vectors of any candidate file. For another example, a product of the second similarity between each slice vector in the second slice vector set of any candidate file and the session vector after the dimension reduction processing and the corresponding set coefficient may be used as the first similarity corresponding to each slice vector in the first slice vector set of any candidate file.
Step 206, determining at least one fourth slice vector from the plurality of slice vectors in the second slice vector set of the candidate file according to the first similarity corresponding to each slice vector in the first slice vector set of the candidate file.
As one example, at least one fourth slice vector having a first similarity greater than a set similarity threshold may be determined from a plurality of slice vectors in a second set of slice vectors of any candidate file.
Step 207, fusing the fourth slice vectors corresponding to the candidate file to obtain the file vector of the candidate file.
Further, at least one fourth slice vector corresponding to any candidate file is fused to perform vectorization expression on the any candidate file, namely, a file vector of the any candidate file is generated.
And step 208, determining target files from the candidate files according to the third similarity between the file vectors of the candidate files and the session vectors, and sending the target files to the client.
As an example, acquiring a heat value of a candidate file and a social group characteristic value to which a target object belongs, and weighting a third similarity according to the heat value of the candidate file and the social group characteristic value to which the target object belongs to obtain a fourth similarity; sorting the candidate files according to the fourth similarity of the candidate files to obtain candidate file sequences; and determining a target file sequence from the candidate file sequences, and sending the target file sequence to the client, wherein the target file sequence comprises at least one target file. For example, the candidate files are sequenced according to the fourth similarity corresponding to the candidate files from large to small, a file list in a preset number with higher similarity is returned as a target file sequence, and the target file sequence is sent to the client.
It should be noted that, because some servers (e.g., cloud disk) have certain social properties, such as circle function, specific group file sharing function, etc., the file importance of different groups is different for users, and the priority of the group corresponding file with a tighter relationship is higher; the popularity value of any candidate file can be determined according to the browsing times, the collection times, the sharing times, the downloading times or the comment times and the like of the candidate file, and the popularity value of the candidate file can be in positive correlation with the browsing times, the collection times, the sharing times, the downloading times or the comment times and the like of the candidate file, for example, the higher the browsing times of the candidate file, the higher the popularity value of the candidate file; the higher the collection times of any candidate file, the higher the heat value of the candidate file; the more the sharing times of the candidate file are, the higher the heat value of the candidate file is; the more the number of downloads of a candidate file, the higher the heat value of the candidate file.
In the embodiment of the disclosure, when the target file sequence is sent to the client, summary information corresponding to each file in the target file sequence may also be sent to the client, and as an example, according to a second similarity corresponding to each fourth vector of any target file in the target file sequence, a plurality of fourth slice vectors of any target file are ordered to obtain a slice vector sequence corresponding to any target file; acquiring a text slice sequence corresponding to a slice vector sequence of any target file from a text slice set corresponding to any target file; splicing all text slices in a text slice sequence corresponding to any target file to obtain a spliced text; constructing prompt words according to the spliced text and the search keywords corresponding to any target file, and abstracting any target file according to the prompt words corresponding to any target file to obtain abstract information of any target file; and sending the target file sequence and abstract information corresponding to each target file in the target file sequence to the client so as to display each target file in the target file sequence and the abstract corresponding to each target file.
For example, the object file d in the object file sequence is exemplified, and the fourth vectors of the object file d areThe second similarity ρ d,c corresponding to each fourth vector of any object file orders the corresponding plurality of fourth vectors to obtain a slice vector sequence (c 1,c2,...cT) corresponding to the any object file,Sequentially extracting text sequences corresponding to the slice vector sequences (c 1,c2,...cT) corresponding to any target file according to the file slice segmentation rule, and splicing to obtain spliced texts; constructing a prompt word according to the spliced text and the search keyword corresponding to any target file, for example, extracting the keyword from the spliced text, taking the keyword and the search keyword extracted from the spliced text as the prompt word, or taking each word and the search keyword in the spliced text as the prompt word; and further, adopting a semantic model, searching the similarity content according to the prompt words, and extracting abstract summary.
In order to facilitate the user to review the operation records, the search recommendation records can be uploaded to the blockchain stock certificate source tracing, and the original operation records are stored in the centralized database.
In summary, feature dimension reduction is performed on a plurality of slice vectors in a first slice vector of a candidate file according to any candidate file to obtain a second slice vector set, and feature dimension reduction is performed on a session vector according to the plurality of slice vectors in the first slice vector of the candidate file to obtain a session vector after dimension reduction processing; determining a first similarity corresponding to each slice vector in a first slice vector set of a candidate file according to a second similarity between each slice vector in the second slice vector set and the session vector after the dimension reduction processing; determining at least one fourth slice vector from a plurality of slice vectors in a second slice vector set of the candidate file according to the first similarity corresponding to each slice vector in the first slice vector set of the candidate file; fusing the fourth slice vectors corresponding to the candidate files to obtain file vectors of the candidate files; according to the third similarity between the file vector and the session vector of each candidate file, determining the target file from each candidate file, and sending each target file to the client, thereby determining the first similarity corresponding to each slice vector in the second slice vector set of any candidate file by adopting the second slice vector set of any candidate file obtained by the dimension reduction processing and the session vector after the dimension reduction processing, determining the target file from a plurality of candidate files according to the first similarity corresponding to each slice vector in the second slice vector set of any candidate file, and reducing the complexity of determining each target file from a plurality of candidate files.
In order to clearly explain how to acquire the preference vector of the target object of the login client and the keyword vector of the search keyword in the search sentence in the above-described embodiment, the present disclosure proposes another search method.
Fig. 3 is a flowchart of another search method according to an embodiment of the disclosure.
As shown in fig. 3, the search method may include the steps of:
Step 301, in response to receiving a file search request sent by a client, acquiring a file name and a file title of any candidate file in a plurality of candidate files associated with the client, and processing text information of any candidate file according to the file type of any candidate file to obtain a text slice set.
Step 302, a first set of slice vectors is generated from the set of text slices, the file name and the file header.
Step 303, obtaining a behavior sequence of the target object of the login client to the plurality of candidate files in a set period.
For example, a usage behavior sequence (b 1,b2,...bJ) of the user on the file stored on the server in the set period is extracted, and an operation time sequence and a file index sequence corresponding to the usage behavior sequence are respectively (t 1,t2,...tJ) and (r 1,r2,...,rJ), wherein J is the total record number of the operation behaviors.
Step 304, determining a preference vector of the target object according to the behavior sequence.
As an example, the behavior sequence is weighted according to the social group feature value to which the target object belongs, so as to obtain a preference vector of the target object. For example, it can be expressed as the following formula:
Where f 1(bj,tj) represents a user behavior time decay weight function and f 2(bj) represents a user behavior weight function, such as interest preference weights expressed by file browsing, collection, sharing, downloading, commenting, etc., are not the same. Because some servers (e.g., cloud disk) have certain social properties, such as circle function, specific group file sharing function, etc., the importance of files in different groups is not the same for users, and the more closely related group corresponding files generally have higher priority. f 3(uid,rj) represents the group social characteristic weight function (social group feature value) of user u id and file r j, if the file exists in multiple groups of the user, then the greatest weight is taken, The vector representing the file r j is calculated as follows:
In step 305, search keywords in the search sentence are extracted, and vectorized representation is performed on the search keywords to obtain feature vectors of the search keywords.
As an example, search keywords in a search sentence are extracted, and the extracted search keywords may be vectorized using a semantic model to obtain feature vectors of the search keywords. The semantic Model may be a generative pre-training transducer (GENERATIVE PRE-trained Transformer, GPT for short) Model, and a chat universal Language Model (CHAT GENERAL Language Model, chatGLM for short).
And 306, carrying out weighted fusion on the preference vector and the feature vector to obtain a session vector.
As an example, the preference vector and the feature vector are weighted and fused to obtain a session vector, which can be expressed by the following formula:
vue=wbevbe+wsevse;
where w be represents the user behavioral interest preference weight, w se represents the search keyword weight, v be represents the preference vector, and v se represents the feature vector of the search keyword.
Step 307, determining at least one target file from the candidate files according to the session vector and the first similarity between the slice vectors in the first slice vector set of the candidate files, and sending the target files to the client.
In summary, the behavior sequence of the target object logging in the client to the candidate file in a set period is obtained; determining a preference vector of the target object according to the behavior sequence; extracting search keywords in search sentences, and vectorizing the search keywords to obtain feature vectors of the search keywords, so that preference vectors of target objects can be effectively determined according to behavior sequences of the target objects on a plurality of candidate files within a set period, vectorizing the search keywords, and effectively determining feature vectors of the search keywords, thereby realizing that the preference and the search keywords of users are considered on the basis of accurately representing the semantics of candidate files, accurately recalling files required by the users from the candidate files, meeting personalized search requirements of different users, and improving search experience of the users.
To clearly illustrate how the text information of any candidate file is processed according to the file type of any candidate file in the above embodiment to obtain a text slice set, the present disclosure proposes another search method.
Fig. 4 is a flowchart of another search method according to an embodiment of the disclosure.
As shown in fig. 4, the search method may include the steps of:
Step 401, in response to receiving a file search request sent by a client, acquiring a file name and a file title of any candidate file from a plurality of candidate files associated with the client.
Step 402, extracting text information of any candidate file according to the file type of any candidate file.
As an example, in response to the file type of any candidate file being a picture, extracting a plurality of image features from any candidate file, and converting any image feature of the plurality of image features of any candidate file into corresponding text description information; and determining the text information of any candidate file according to the text description information corresponding to any image feature in the plurality of image features of any candidate file. For example, the text description information corresponding to any one of the plurality of image features of any one candidate file is spliced to obtain the text information of any one candidate file.
As another example, in response to the file type of any candidate file being audio, audio data in any candidate file is extracted; converting the audio data in any candidate file into corresponding text description information; and converting the audio data in any candidate file into corresponding text description information, and determining the text information of any candidate file. For example, text description information corresponding to the audio data in any candidate file is used as the text information of any candidate file.
It should be noted that, in order to satisfy the personalized permission configuration, when the file search is recommended, a specific file may be designated for searching, for example, a file belonging to a certain group, a file in a specific format, a file in a search permission, etc. may be selected, and at the same time, it may be set to exclude the specific file for searching so as to satisfy the personalized permission configuration,
As one example, from among a plurality of file albums stored, a target file album matching the search statement is determined from the search statement in the file search request; files in the target file album are taken as a plurality of candidate files associated with the client.
As another example, extracting a file format of a first file to be searched in a search sentence, and obtaining a plurality of first files to be searched matched with the file format from a plurality of stored files; and taking the plurality of first files to be searched as a plurality of candidate files associated with the client.
As yet another example, a search right matching the target object is obtained, and a plurality of second files to be searched matching the search right are obtained from the stored plurality of files; and taking the plurality of second files to be searched as a plurality of candidate files associated with the client.
And step 403, performing text segmentation on the text information of any candidate file according to the length information and the set slice length information of the text information of any candidate file so as to obtain a text slice set of any candidate file.
Wherein the text slice set comprises a plurality of text slices.
In order to keep the semantic information between the adjacent slices smooth, text information repetition can exist between the adjacent slices, namely, the tail of the previous slice and the head of the next slice are kept to be repeated, as an example, the length information of the text information of any candidate file is L d, the slice length information is L max, when the length information L d of the text information of any candidate file is less than or equal to L max, the segmentation is not performed, and at the moment, the text slice set corresponding to the text information of any candidate file only comprises 1 text slice; on the contrary, the segmentation is performed in a manner of smooth movement and overlapping of adjacent slices, and the number of text slices in the text slice set corresponding to the text information of any candidate file may beWherein, Representing an upward rounding.
As another example, the text information of any candidate file may be text-segmented according to paragraphs to obtain a text slice set of any candidate file, for example, the text information of any candidate file includes 5 text paragraphs, each paragraph is a text slice, and the text slice set of any candidate file may include 5 text slices.
Step 404, generating a first set of slice vectors according to the set of text slices, the file name and the file header.
Step 405, obtain the preference vector of the target object of the login client and the feature vector of the search keyword in the search statement in the search request, and fuse the preference vector with the feature vector to obtain the session vector.
Step 406, determining at least one target file from the candidate files according to the session vector and the first similarity between the slice vectors in the first slice vector set of the candidate files, and sending the target files to the client.
In summary, extracting text information of any candidate file according to the file type of any candidate file; according to the length information of the text information of any candidate file and the set slice length information, text segmentation is carried out on the text information of any candidate file so as to obtain a text slice set of any candidate file; the text slice set comprises a plurality of text slices, so that the candidate files are vectorized and represented in slice level based on file types of the candidate files and by considering file names and file titles, session vectors obtained by combining preference vectors of target objects and feature vectors of search keywords in search sentences are combined, the plurality of target files are determined, preference and search keywords of users are considered on the basis of accurately representing semantics of the candidate files, files required by the users are accurately recalled from the candidate files, personalized search requirements of different users are met, and search experience of the users is improved.
On the basis of any embodiment of the present disclosure, taking a server as a cloud disk as an example, an implementation flow of the present disclosure may be as shown in fig. 5, and mainly includes the following steps:
1. the file slice corpus (text slice set) generating step may be as shown in fig. 6, and specifically may include the following steps:
(1) Text information extraction: taking a file D (D is more than or equal to 1 and less than or equal to D) as an example, firstly judging the file type, and if the file belongs to a text file, directly extracting text information; if the scene information belongs to the picture file, acquiring text information such as scene tags by means of a picture generation Wen Yuyi analysis model; if the text information belongs to the audio file, acquiring the text information by means of the audio-to-text semantic analysis model, otherwise, treating the text information as an invalid file. It should be noted that, the graphics text and the audio text have a mature large language model technology or tools such as API interfaces, for example, GPT-4 and DALL-E, whisper, which will not be described here.
(2) Text information preprocessing: cleaning the text information obtained in the step 1), including preprocessing operations such as removing stop words;
(3) Text slice generation: to keep the semantic information between adjacent slices smooth, there is a repetition of text information between adjacent slices, i.e. the tail of the previous slice and the head of the next slice remain repeated. Assuming that the text information length obtained in the step 2) is L d, the maximum length of the to-be-segmented slice is L max, and the repeated text length between adjacent slices is L cp, the segmentation is performed according to the following principle: when the text information length is not greater than the maximum slice length, namely L d≤Lmax, no segmentation is performed, and the file has only N d =1 slices and the slice length is L d, otherwise, the file is segmented in a smooth movement mode and adjacent slices are overlapped, and the number of the slices is Wherein, The representation is rounded up.
2. File slice vectorization
(1) Name heading vectorization: extracting the name and the title of a file D (D is more than or equal to 1 and less than or equal to D), and carrying out vectorization representation through a semantic analysis model to obtain corresponding vectors v file,d and v title,d;
(2) Slice content vectorization: aiming at a slice corpus (text slice set) s d,c(1≤c≤Nd of file segmentation, sequentially inputting a semantic analysis model for vectorization to obtain a corresponding vector v chip,d,c;
(3) Slice vector weighted fusion: let the weights corresponding to the file name, title and slice be w file,d、wtitle,d and w chip,d respectively, then the vector after weighted fusion of slice vectors is:
(4) Slice vector storage: and (3) storing the slice vector obtained in the step (3) and the corresponding expected content thereof into a vector database, so as to facilitate data persistence and subsequent vector retrieval operation.
3. File slice feature dimension reduction
(1) Clustering slice vectors: clustering the file slice vectors v d,c(1≤d≤D,1≤c≤Nd) according to a preset rule, for example, adopting a K-means algorithm;
(2) Random sampling forms a sample set: assuming that the total clustering number after the clustering of the slices is K, the total slice vector number of the clustering K is N cluster,k, the number of the sample set slices to be extracted is N sample, and the vector number N sample,k randomly sampled in the clustering K is
Wherein, The representation is rounded upwards, randomly extracted slices form a sample set vector X, and the number of samples is updated to N sample=ΣNsample,k;
(3) Let the ith vector in sample vector set X be X i(1≤i≤Nsample), then the sample mean vector is
(4) Aiming at all vectors in the sample vector set X, namely Q= (X 1-vmean,x2-vmean,...xNample-vmean), calculating a covariance matrix QQ T, carrying out eigenvalue decomposition on the covariance matrix QQ T, and extracting eigenvectors corresponding to the largest first M eigenvalues to construct an eigenvector matrix P;
(5) According to the sample mean value vector and the feature vector matrix, the feature dimension reduction processing of the slice vector can be specifically expressed as follows:
4. User search session vectorized expression
(1) Extracting user cloud disk historical behaviors: extracting a user cloud disk file use behavior sequence (b 1,b2,...bJ) within a set time period, wherein the corresponding operation time sequence and file index sequence are (t 1,t2,...tJ) and (r 1,r2,...,rJ) respectively, and J is the total record number of operation behaviors;
(2) User interest preference vector expression: according to the step 1) user use behaviors, weighting and calculating corresponding user interest preference vector expression:
Where f 1(bj,tj) represents a user behavior time decay weight function and f 2(bj) represents a user behavior weight function, such as interest preference weights expressed by file browsing, collection, sharing, downloading, commenting, etc., are not the same. Because some servers (e.g., cloud disk) have certain social properties, such as circle function, specific group file sharing function, etc., the importance of files in different groups is not the same for users, and the more closely related group corresponding files generally have higher priority. f 3(uid,rj) represents the group social characteristic weight function (social group feature value) of user u id and file r j, if the file exists in multiple groups of the user, then the greatest weight is taken, The vector representing file rj is calculated as follows:
(3) Search keyword vectorization expression: vectorizing the search keywords of the user by means of the same potential semantic analysis model as in the file slice vectorizing method to obtain a search keyword vector (feature vector) v se;
(4) User search session vectorization expression: weighting by user-history behavioral interest preference vectors and search keyword vectors (feature vectors), i.e.
vue=wbevbe+wsevse;
Where w be represents the user behavioral interest preference weight and w se represents the search keyword weight.
5. File slice similarity retrieval
(1) File slice similarity calculation: feature-based dimension-reduced slice vectorAnd user search session vectorCalculating the similarity ρ d,c, i.e
(2) File slice screening: presetting a similarity threshold rho th, and selecting file slices meeting the following conditions
6. Document similarity retrieval
(1) File vector expression: according to the preliminary retrieval result of the file sliceWeighted fusion forms file vectorized representations I.e.
(2) File similarity calculation: sequentially comparing the similarity of the file vector in the step 1) and the user search session vector according to the following steps
7. File search recommendation list ranking
(1) File weighted similarity calculation: weighting file similarity considering file popularity and social group characteristicsI.e.
(2) Returning a file search list: and sequencing the file retrieval results according to the weighted similarity from large to small, and returning file lists in a preset number with higher similarity as a current search session recommended file list.
As a possible implementation manner of the embodiment of the present disclosure, when a cloud disk search personalized recommended file list is returned, a prompt word may be constructed according to a high-similarity slice content corpus, and summary is performed by means of a latent semantic model (such as a GPT, chatGLM, etc.), and the summary content is presented, so that a user can check and confirm a search result. The method is as follows
(1) Slice index extraction: taking a file d in the personalized recommendation list as an example, extracting a file corresponding slice retrieval result in the file slice similarity retrieval methodAnd ordered according to the similarity ρ d,c, a slice index (slice vector sequence) of (c 1,c2,...cT) is obtained, and
(2) Corpus extraction: sequentially extracting text sequences corresponding to the slice indexes in the step 1) according to a file slice segmentation rule, and segmenting the slices by using separators;
(3) Summary of semantic model abstract: and constructing prompt words according to the corpus and the search keywords, searching for similarity contents based on the potential semantic model, and extracting abstract summary.
As a possible implementation manner of the embodiments of the present disclosure, when searching and recommending a cloud disc personalized file, a specific file may be specified for searching, for example, a file belonging to a certain group, a file with a specific format, etc. may be selected, and meanwhile, specific files may be excluded for searching, so as to satisfy the personalized permission configuration.
As a possible implementation manner of the embodiments of the present disclosure, in the method for generating a document slice corpus, segmentation may be performed according to a paragraph structure, and there is no need to keep the lengths of the slice text sequences as equal as possible.
For convenience in user review of operation records, audit, log view, and other business operations, as a possible implementation manner of the embodiment of the disclosure, as shown in fig. 7, the cloud disk may upload the personalized search recommendation record of the user file to the blockchain provenance tracing, where the original operation record is stored in the centralized database.
In order to implement the embodiments shown in fig. 1 to 7 described above, the present disclosure proposes a search apparatus.
Fig. 8 is a schematic structural diagram of a search device according to an embodiment of the disclosure.
As shown in fig. 8, the search apparatus 800 includes: a first processing module 810, a generating module 820, a fusing module 830, and a determining module 840.
The first processing module 810 is configured to, in response to receiving a file search request sent by a client, obtain, for any candidate file of a plurality of candidate files associated with the client, a file name and a file title of the any candidate file, and process text information of the any candidate file according to a file type of the any candidate file, so as to obtain a text slice set; a generating module 820, configured to generate a first set of slice vectors according to the set of text slices, the file name and the file title; the fusion module 830 is configured to obtain a preference vector of a target object of the login client and a feature vector of a search keyword in a search statement in the search request, and fuse the preference vector with the feature vector to obtain a session vector; the determining module 840 is configured to determine at least one target file from the candidate files according to the session vector and the first similarity between the slice vectors in the first set of slice vectors of the candidate files, and send the target files to the client.
As a possible implementation manner of the embodiments of the present disclosure, the determining module 840 is configured to perform feature dimension reduction on a plurality of slice vectors in a first slice vector of a candidate file for any candidate file to obtain a second slice vector set, and perform feature dimension reduction on a session vector according to the plurality of slice vectors in the first slice vector of the candidate file to obtain a session vector after dimension reduction processing; determining a first similarity corresponding to each slice vector in the first slice vector set of the candidate file according to the second similarity between each slice vector in the second slice vector set and the session vector after the dimension reduction processing; determining at least one fourth slice vector from a plurality of slice vectors in a second slice vector set of the candidate file according to the first similarity corresponding to each slice vector in the first slice vector set of the candidate file; fusing the fourth slice vectors corresponding to the candidate files to obtain file vectors of the candidate files; and determining target files from the candidate files according to a third similarity between the file vectors and the session vectors of the candidate files, and sending the target files to the client.
As one possible implementation manner of the embodiments of the present disclosure, the determining module 840 is further configured to cluster, for any candidate file, a plurality of slice vectors in a first slice vector of the candidate file, so as to obtain a plurality of vector clusters; according to the ratio of the number of slice vectors in each vector cluster to the set sampling number, extracting sample vectors with the set sample vector number from each vector cluster to generate a sample vector set; and performing feature dimension reduction on a plurality of slice vectors in the first slice vector of the candidate file according to the sample vector set to obtain a second slice vector set.
As one possible implementation manner of the embodiments of the present disclosure, the determining module 840 is further configured to determine a sample mean vector of the sample vector set according to the number of sample vectors in the sample vector set; updating each sample vector in the sample vector set according to the sample mean value vector to obtain an updated sample vector set; determining a covariance matrix of the updated sample vector set; and performing feature dimension reduction on a plurality of slice vectors in the first slice vector of the candidate file according to the covariance matrix and the sample mean vector to obtain a second slice vector set.
As one possible implementation manner of the embodiments of the present disclosure, the determining module 840 is further configured to obtain a difference vector between the session vector and a sample mean vector of the candidate file; and obtaining the conversation vector after the dimension reduction processing according to the product of the difference vector of the candidate file and the covariance matrix.
As a possible implementation manner of the embodiment of the present disclosure, the determining module 840 is further configured to obtain a popularity value of the candidate file and a social group feature value to which the target object belongs, and weight the third similarity according to the popularity value of the candidate file and the social group feature value to which the target object belongs, so as to obtain a fourth similarity; sorting the candidate files according to the fourth similarity of the candidate files to obtain a candidate file sequence; and determining a target file sequence from the candidate file sequences, and sending the target file sequence to the client, wherein the target file sequence comprises at least one target file.
As a possible implementation manner of the embodiments of the present disclosure, the determining module 840 is further configured to sort the fourth slice vectors according to the second similarity corresponding to the fourth vectors of any one of the target files in the sequence of target files, so as to obtain a sequence of slice vectors; acquiring a text slice sequence corresponding to the slice vector sequence from the text slice set; splicing all the text slices in the text slice sequence to obtain a spliced text; constructing a prompt word according to the spliced text and the search keyword, and abstracting any target file according to the prompt word to obtain abstract information of any target file; and sending the target file sequence and abstract information corresponding to each target file in the target file sequence to the client so as to display each target file in the target file sequence and the abstract corresponding to each target file.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: and a transmitting module.
The sending module is used for sending the target file sequence and the summary information corresponding to each file in the target file sequence to the blockchain so as to store the target file sequence and the summary information corresponding to each file in the target file sequence.
As one possible implementation manner of the embodiments of the present disclosure, a generating module 820 is configured to perform vectorization representation on each text slice, file name, and file title in the text slice set, to obtain an initial slice vector set, a file name vector, and a file title vector; and fusing the initial slice vector set, the file name vector and the file header vector to obtain a first slice vector set.
As a possible implementation manner of the embodiments of the present disclosure, a fusion module 830 is configured to obtain a behavior sequence of a target object of a login client on a candidate file within a set period; determining a preference vector of the target object according to the behavior sequence; extracting search keywords in the search sentences, and carrying out vectorization representation on the search keywords to obtain feature vectors of the search keywords.
As a possible implementation manner of the embodiment of the present disclosure, the first processing module 810 is configured to extract text information of any candidate file according to a file type of the any candidate file; according to the length information of the text information of any candidate file and the set slice length information, text segmentation is carried out on the text information of any candidate file so as to obtain a text slice set of any candidate file; wherein the text slice set comprises a plurality of text slices.
As a possible implementation manner of the embodiment of the present disclosure, the first processing module 810 is further configured to extract a plurality of image features from any candidate file in response to the file type of any candidate file being a picture, and convert any image feature in the plurality of image features of any candidate file into corresponding text description information; and determining the text information of any candidate file according to the text description information corresponding to each image characteristic of any candidate file.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: an extraction module and a conversion module.
The extraction module is used for responding to the file type of any candidate file as audio, and extracting audio data in any candidate file; the conversion module is used for converting the audio data in any candidate file into corresponding text description information; the determining module 840 is configured to determine text information of any candidate file according to text description information corresponding to the audio data in any candidate file.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: and a second processing module.
The second processing module is used for determining a target file album matched with the search statement from a plurality of stored file albums according to the search statement in the file search request; and taking the files in the target file album as a plurality of candidate files associated with the client.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: and a third processing module.
The third processing module is used for extracting the file format of a first file to be searched in the search statement, and acquiring a plurality of first files to be searched matched with the file format from a plurality of stored files; and taking the plurality of first files to be searched as a plurality of candidate files associated with the client.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: and a fourth processing module.
The fourth processing module is used for acquiring the searching authority matched with the target object and acquiring a plurality of second files to be searched matched with the searching authority from the stored files; and taking the plurality of second files to be searched as a plurality of candidate files associated with the client.
In response to receiving a file search request sent by a client, the search device of the embodiment of the disclosure obtains a file name and a file title of any candidate file for any candidate file in a plurality of candidate files associated with the client, and processes text information of the any candidate file according to the file type of the any candidate file to obtain a text slice set; generating a first slice vector set according to the text slice set, the file name and the file title; obtaining a preference vector of a target object of a login client and a feature vector of a search keyword in a search statement in a search request, and fusing the preference vector and the feature vector to obtain a session vector; according to the session vector and the first similarity between slice vectors in the first slice vector set of each candidate file, at least one target file is determined from each candidate file, and each target file is sent to a client, so that the candidate files are subjected to vectorization representation of slice levels based on file types of the candidate files and in consideration of file names and file titles, and simultaneously the session vector obtained by combining the preference vector of the target object and the feature vector of the search keyword in the search statement is combined, at least one target file is determined from a plurality of candidate files, the purpose that the preference and the search keyword of a user are considered on the basis of accurately characterizing the semantics of the candidate files is achieved, the files required by the user are accurately recalled from each candidate file, personalized search requirements of different users are met, and the search experience of the user is improved.
It should be noted that the foregoing explanation of the searching method embodiment is also applicable to the searching apparatus of this embodiment, and will not be repeated here.
In an exemplary embodiment, an electronic device is also presented.
Wherein, electronic equipment includes:
A processor;
A memory for storing processor-executable instructions;
Wherein the processor is configured to execute instructions to implement a search method as set forth in any of the foregoing embodiments.
As an example, fig. 9 is a schematic structural diagram of an electronic device 900 according to an exemplary embodiment of the disclosure, where, as shown in fig. 9, the electronic device 900 may further include:
Memory 910 and processor 920, bus 930 connecting the different components (including memory 910 and processor 920), memory 910 storing a computer program that when executed by processor 920 implements the search method described in the embodiments of the present disclosure.
Bus 930 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 900 typically includes a variety of electronic device readable media. Such media can be any available media that is accessible by electronic device 900 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 910 may also include computer-system readable media in the form of volatile memory such as Random Access Memory (RAM) 940 and/or cache memory 950. The server 900 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 960 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, commonly referred to as a "hard disk drive"). Although not shown in fig. 9, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 930 via one or more data medium interfaces. Memory 910 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the various embodiments of the disclosure.
A program/utility 980 having a set (at least one) of program modules 970 may be stored, for example, in memory 910, such program modules 970 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 970 generally perform the functions and/or methods in the embodiments described in this disclosure.
The electronic device 900 may also communicate with one or more external devices 990 (e.g., keyboard, pointing device, display 991, etc.), one or more devices that enable a user to interact with the electronic device 900, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 992. Also, the electronic device 900 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter 993. As shown, the network adapter 993 communicates with other modules of the electronic device 900 over the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 900, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processor 920 performs various functional applications and data processing by running programs stored in the memory 910.
It should be noted that, the implementation process and the technical principle of the electronic device in this embodiment refer to the foregoing explanation of the search method in the embodiment of the disclosure, and are not repeated herein.
In an exemplary embodiment, a computer readable storage medium is also provided, e.g. a memory, comprising instructions executable by a processor of an electronic device to perform the search method set forth in any of the embodiments described above. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In an exemplary embodiment, a computer program product is also provided, comprising a computer program/instruction, characterized in that the computer program/instruction, when executed by a processor, implements the search method proposed by any of the above embodiments.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.