Nothing Special   »   [go: up one dir, main page]

CN117331893B - Search method, device, electronic device and storage medium - Google Patents

Search method, device, electronic device and storage medium Download PDF

Info

Publication number
CN117331893B
CN117331893B CN202311224005.8A CN202311224005A CN117331893B CN 117331893 B CN117331893 B CN 117331893B CN 202311224005 A CN202311224005 A CN 202311224005A CN 117331893 B CN117331893 B CN 117331893B
Authority
CN
China
Prior art keywords
file
vector
slice
candidate
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311224005.8A
Other languages
Chinese (zh)
Other versions
CN117331893A (en
Inventor
郑正广
关矛
张�杰
余东辉
林立言
张云
闫宇
钟声振
郑永欣
沈子璐
彭小成
方培先
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Internet Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Internet Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Internet Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202311224005.8A priority Critical patent/CN117331893B/en
Publication of CN117331893A publication Critical patent/CN117331893A/en
Application granted granted Critical
Publication of CN117331893B publication Critical patent/CN117331893B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提出一种搜索方法、装置、电子设备及存储介质,方法包括:响应于接收到客户端发送的文件搜索请求,根据与客户端关联的多个候选文件中任一候选文件的文件类型,对任一候选文件的文本信息进行处理,以得到文本切片集合;根据文本切片集合、文件名称和文件标题,生成第一切片向量集合;将登录客户端的目标对象的偏好向量和搜索请求中搜索语句中的搜索关键词的特征向量进行融合,得到会话向量;根据会话向量和各候选文件的第一切片向量集合中各切片向量之间的第一相似度,从各候选文件中确定至少一个目标文件,将各目标文件发送至客户端,由此,从各候选文件中准确地召回用户所需的文件,满足了不同用户的个性化搜索需求。

The present disclosure proposes a search method, device, electronic device and storage medium. The method includes: in response to receiving a file search request sent by a client, processing text information of any candidate file according to the file type of any candidate file among multiple candidate files associated with the client to obtain a text slice set; generating a first slice vector set according to the text slice set, the file name and the file title; fusing the preference vector of the target object logged in to the client and the feature vector of the search keyword in the search statement in the search request to obtain a session vector; determining at least one target file from each candidate file according to the first similarity between the session vector and each slice vector in the first slice vector set of each candidate file, and sending each target file to the client, thereby accurately recalling the files required by the user from each candidate file, and meeting the personalized search needs of different users.

Description

Searching method, searching device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of data processing, and in particular relates to a searching method, a searching device, electronic equipment and a storage medium.
Background
Currently, some servers (e.g., cloud disks) allow users to upload their own files for storage backup and synchronous sharing, so that the files can be accessed and managed anytime and anywhere. For users, the search files are just needed operation in daily work, so that the users can be helped to quickly and accurately find the needed files, the work efficiency is improved, and the user perception experience is enhanced.
In the related art, files required by a user are recalled from each file based on the similarity between a search sentence input by the user and the files stored in the server, but the personalized search requirements of different users may not be met only according to the mode of recalling the files according to the similarity between the search sentence and the files, and the search experience of the user is reduced.
Disclosure of Invention
The present disclosure aims to solve, at least to some extent, one of the technical problems in the related art.
Therefore, a first object of the present disclosure is to propose a search method to perform vectorization representation of slice levels on candidate files based on file types of the candidate files and considering file names and file titles, and simultaneously combine a preference vector of a target object and a session vector obtained by fusing feature vectors of search keywords in search sentences to determine at least one target file from a plurality of candidate files, thereby realizing that the user preference and the search keywords are considered on the basis of accurately characterizing the semantics of the candidate files, accurately recalling files required by the user from each candidate file, meeting personalized search requirements of different users, and improving search experience of the user.
A second object of the present disclosure is to propose a search device.
A third object of the present disclosure is to propose an electronic device.
A fourth object of the present disclosure is to propose a non-transitory computer readable storage medium storing computer instructions.
A fifth object of the present disclosure is to propose a computer programme product.
To achieve the above object, an embodiment of a first aspect of the present disclosure provides a search method, including: responding to a file search request sent by a client, acquiring a file name and a file title of any candidate file in a plurality of candidate files associated with the client, and processing text information of the any candidate file according to the file type of the any candidate file to obtain a text slice set; generating a first slice vector set according to the text slice set, the file name and the file title; obtaining a preference vector of a target object logged in the client and a feature vector of a search keyword in a search statement in the search request, and fusing the preference vector and the feature vector to obtain a session vector; and determining at least one target file from the candidate files according to the session vector and the first similarity between slice vectors in the first slice vector set of the candidate files, and sending the target files to the client.
To achieve the above object, an embodiment of a second aspect of the present disclosure provides a search apparatus, including: the first processing module is used for responding to a file search request sent by a receiving client, acquiring a file name and a file title of any candidate file in a plurality of candidate files associated with the client, and processing text information of the any candidate file according to the file type of the any candidate file so as to obtain a text slice set; the generation module is used for generating a first slice vector set according to the text slice set, the file name and the file title; the fusion module is used for acquiring a preference vector of a target object logged in the client and a feature vector of a search keyword in a search statement in the search request, and fusing the preference vector and the feature vector to obtain a session vector; and the determining module is used for determining at least one target file from the candidate files according to the session vector and the first similarity between the slice vectors in the first slice vector set of the candidate files, and sending the target files to the client.
To achieve the above object, an embodiment of a third aspect of the present disclosure provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the search method as described in the embodiment of the first aspect.
To achieve the above object, a fourth aspect embodiment of the present disclosure proposes a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the search method as described in the first aspect embodiment.
To achieve the above object, an embodiment of a fifth aspect of the present disclosure proposes a computer program product, which when executed by an instruction processor in the computer program product, performs the search method according to the embodiment of the first aspect.
Additional aspects and advantages of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
fig. 1 is a schematic flow chart of a search method according to an embodiment of the disclosure;
FIG. 2 is a flowchart of another searching method according to an embodiment of the disclosure;
FIG. 3 is a flowchart illustrating another searching method according to an embodiment of the disclosure;
FIG. 4 is a flowchart of another searching method according to an embodiment of the disclosure;
FIG. 5 is a flowchart of another searching method according to an embodiment of the disclosure;
FIG. 6 is a schematic flow chart of generating a document slice corpus according to an embodiment of the disclosure;
FIG. 7 is a schematic diagram of uploading user file personalized search recommendation records to a blockchain according to embodiments of the present disclosure;
fig. 8 is a schematic structural diagram of a search device according to an embodiment of the disclosure;
Fig. 9 is a schematic structural view of an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present disclosure and are not to be construed as limiting the present disclosure.
The server (e.g., cloud disk) includes various forms such as text files, picture files, audio files, and video files, and has different formats. The traditional personalized file searching method mainly comprises the following steps:
(1) Searching the file according to the search statement of the user;
(2) Recommending related files according to social network information of users, such as friends, groups and the like;
(3) Preprocessing operations such as Word segmentation are carried out on the file text, word2Vec classical Word vector models are adopted to obtain Word vector expressions, weighted summation is carried out to obtain the file vector expressions, and searching recommendation is carried out through vector similarity;
The scheme (1) is simple and easy to realize, but cannot reflect the interests of users; the scheme (2) needs the user to actively maintain own social network information, and easily limits the file searching range, so that the diversity of file recommendation is affected; word2Vec in scheme (3) is insufficient in understanding complex semantic relation and context information, has limited precision, mainly focuses on the text single-mode field, and easily ignores deep semantic information of long text files.
In view of the foregoing, the present disclosure proposes a search method, apparatus, electronic device, and storage medium.
The following describes a search method, apparatus, electronic device, and storage medium of the embodiments of the present disclosure with reference to the accompanying drawings.
Fig. 1 is a flowchart of a search method according to an embodiment of the disclosure.
As shown in fig. 1, the search method may include the steps of:
Step 101, in response to receiving a file search request sent by a client, for any candidate file in a plurality of candidate files associated with the client, acquiring a file name and a file title of the any candidate file, and processing text information of the any candidate file according to a file type of the any candidate file to obtain a text slice set.
As one example, a file uploaded by a client, a file restored by a client, etc., may be considered a plurality of candidate files associated with the client.
It may be understood that the plurality of candidate files associated with the client may include a plurality of file types, for example, the file types of the plurality of candidate files associated with the client may include text, pictures, audio, and the like, so that, in order to accurately represent solicit articles pieces of semantics, a file name and a file title of any candidate file may be obtained, text information of any candidate file may be extracted according to the file type of any candidate file, text information of any candidate file may be subjected to text segmentation according to length information of the text information, and a text slice set corresponding to the text information of any candidate file may be obtained, where the text slice set includes a plurality of text slices.
It should be noted that, since the plurality of candidate files associated with the client may include a plurality of file types, some candidate files (e.g., pictures) of the file types may not include file titles, for a candidate file including a file title in the candidate file, the file title of the candidate file may be directly extracted, for a candidate file not including a file title in the candidate file, text information of the candidate file may be obtained first, and the file title of the candidate file may be extracted from the text information of the candidate file.
Step 102, a first set of slice vectors is generated from the set of text slices, the file name and the file header.
In order to further improve the accuracy of representing file semantics, each text slice in the text slice set of any candidate file, the file name of any candidate file, and the file header may be represented in a vectorization manner to obtain a first slice vector set, a file name vector, and a file header vector of any candidate file, and then, the initial slice vector set, the file name vector, and the file header vector of any candidate file are fused to obtain the first slice vector set of any candidate file.
And step 103, obtaining a preference vector of a target object of the login client and a feature vector of a search keyword in a search statement, and carrying out weighted fusion on the preference vector and the feature vector to obtain a session vector.
In order to meet personalized search of users, on the basis of accurately representing file semantics, target object preference and search keywords of a login client can be combined to perform search.
The preference vector of the target object of the login client can be obtained according to the behavior sequences of the target object on the plurality of candidate files within a set period, the feature vector of the search keyword in the search statement can be obtained by vectorizing the search keyword in the search statement, and the preference vector of the target object of the login client can be used for indicating the preference degree of the target object on each candidate file.
Step 104, determining at least one target file from the candidate files according to the session vector and the first similarity between the slice vectors in the first slice vector set of the candidate files, and sending the target files to the client.
As an example, a first similarity between the session vector and each slice vector in the first set of slice vectors for each candidate file may be calculated, at least one target file may be determined from each candidate file based on the first similarity corresponding to each slice vector in the second set of slice vectors for each candidate file, and each target file may be sent to the client.
As another example, to reduce the complexity of computation, feature dimension reduction is performed on a plurality of slice vectors in a first slice vector of a candidate file for any candidate file to obtain a second slice vector set, feature dimension reduction is performed on a session vector according to the plurality of slice vectors in the first slice vector of the candidate file to obtain a session vector after dimension reduction processing, a first similarity corresponding to each slice vector in the second slice vector set of any candidate file is determined according to a second similarity between each slice vector in the second slice vector set of any candidate file and the session vector after dimension reduction processing, at least one target file is determined from each candidate file according to a first similarity corresponding to each slice vector in the second slice vector set of each candidate file, and each target file is sent to a client.
In summary, in response to receiving a file search request sent by a client, for any candidate file in a plurality of candidate files associated with the client, acquiring a file name and a file title of the any candidate file, and processing text information of the any candidate file according to the file type of the any candidate file to obtain a text slice set; generating a first slice vector set according to the text slice set, the file name and the file title; obtaining a preference vector of a target object of a login client and a feature vector of a search keyword in a search statement in a search request, and fusing the preference vector and the feature vector to obtain a session vector; according to the session vector and the first similarity between slice vectors in the first slice vector set of each candidate file, at least one target file is determined from each candidate file, and each target file is sent to the client, so that the candidate files are subjected to vectorization representation of slice levels based on file types of the candidate files and in consideration of file names and file titles, and simultaneously the session vector obtained by combining the preference vector of the target object and the feature vector of the search keyword in the search statement is determined from a plurality of candidate files, the purpose that the preference and the search keyword of the user are considered on the basis of accurately representing the semantics of the candidate files is achieved, the files required by the user are accurately recalled from each candidate file, the personalized search requirements of different users are met, and the search experience of the user is improved.
To clearly illustrate how in the above embodiments at least one target file is determined from each candidate file according to the session vector and the first similarity between each slice vector in the first set of slice vectors of each candidate file, and each target file is sent to the client, the present disclosure proposes another search method.
Fig. 2 is a flowchart of another search method according to an embodiment of the disclosure.
As shown in fig. 2, the search method may include the steps of:
Step 201, in response to receiving a file search request sent by a client, for any candidate file in a plurality of candidate files associated with the client, acquiring a file name and a file title of the any candidate file, and processing text information of the any candidate file according to a file type of the any candidate file, so as to obtain a text slice set.
Step 202, a first set of slice vectors is generated from the set of text slices, the file name and the file header.
Step 203, obtaining a preference vector of the target object of the login client and a feature vector of the search keyword in the search statement in the search request, and fusing the preference vector and the feature vector to obtain a session vector.
Step 204, performing feature dimension reduction on a plurality of slice vectors in the first slice vector of the candidate file according to any candidate file to obtain a second slice vector set, and performing feature dimension reduction on the session vector according to the plurality of slice vectors in the first slice vector of the candidate file to obtain a session vector after dimension reduction processing.
In order to reduce the computational complexity, as an example, feature dimension reduction may be performed on a plurality of slice vectors and session vectors in the first slice vector of any candidate file, to obtain a second slice vector set and a session vector after dimension reduction processing.
As an example, the trained feature dimension reduction model may be used to perform feature dimension reduction on the plurality of slice vectors and the session vector in the first slice vector of any candidate file, so as to obtain a second slice vector set of any candidate file and the session vector after dimension reduction processing.
As another example, clustering a plurality of slice vectors in the first slice vector of any candidate file to obtain a plurality of vector clusters corresponding to any candidate file; sampling is carried out according to the ratio of the clustering quantity of each vector corresponding to any candidate file to the set sampling quantity, a sample vector set of any candidate file is obtained, and dimension reduction is carried out on a plurality of slice vectors in the first slice vector of any candidate file according to the sample vector set of any candidate file, so that a second slice vector set of any candidate file is obtained.
According to the sample vector set of any candidate file, the dimension of the plurality of slice vectors in the second slice vector of any candidate file is reduced, and the third slice vector set of any candidate file is obtained by the following steps: according to the number of sample vectors in the sample vector set of any candidate file, determining a sample mean vector of the sample vector set of any candidate file; updating each sample vector in the sample vector set of any candidate file according to the sample mean value vector corresponding to any candidate file to obtain an updated sample vector set; determining covariance matrixes of updated sample vector sets corresponding to any candidate file; and performing feature dimension reduction on the plurality of slice vectors in the second slice vector set according to the covariance matrix and the sample mean vector corresponding to any candidate file to obtain a third slice vector set.
That is, assuming that the total number of clusters after clustering of the slices is K, the total number of slice vectors of the cluster K is N cluster,k, the number of sample set slices to be extracted is N sample, the number of vectors randomly sampled in the cluster K is N sample,k Wherein, Representing rounding up, forming a sample set vector X by randomly extracted slices, updating the number of samples to N sample=ΣNsample,k, setting the ith vector in the sample set vector X as X i(1≤i≤Nsample), and determining the sample mean value vector asFor all vector average removal processing in the sample vector set X, namely q= (X 1-vmean,x2-vmean,...xNample-vmean), calculating a covariance matrix QQ T, decomposing eigenvalues, extracting eigenvectors corresponding to the largest first M eigenvalues to construct an eigenvector matrix P, and performing eigenvalue dimension reduction processing on slice vectors according to the sample average vector and the eigenvector matrix, wherein the method specifically comprises the following steps of: Where v d,c denotes each slice vector in the first set of slice vectors, The representation v d,c corresponds to the second slice vector.
And simultaneously, according to the sample vector set of any candidate file, carrying out characteristic dimension reduction on the session vector to obtain the session vector after dimension reduction processing. As one example, a difference vector between the session vector and the sample mean vector of any candidate file is obtained; and obtaining the conversation vector after the dimension reduction processing according to the product of the difference vector and the covariance matrix of any candidate file. For example, the session vector is denoted as v ue, and the session vector after any candidate file is subjected to dimension reduction processingWherein, P represents covariance matrix corresponding to any candidate file, and v mean represents sample mean vector of any candidate file.
Step 205, determining a first similarity corresponding to each slice vector in the first slice vector set of the candidate file according to the second similarity between each slice vector in the second slice vector set and the session vector after the dimension reduction processing.
Further, a second similarity between each slice vector in the second slice vector set of any candidate file and the session vector after the dimension reduction processing is calculated, and a first similarity corresponding to each slice vector in the first slice vector set of any candidate file is determined according to the second similarity between each slice vector in the second slice vector set of any candidate file and the session vector after the dimension reduction processing. For example, the second similarity between each slice vector in the second set of slice vectors of any candidate file and the session vector after the dimension reduction processing may be used as the first similarity corresponding to each slice vector in the first set of slice vectors of any candidate file. For another example, a product of the second similarity between each slice vector in the second slice vector set of any candidate file and the session vector after the dimension reduction processing and the corresponding set coefficient may be used as the first similarity corresponding to each slice vector in the first slice vector set of any candidate file.
Step 206, determining at least one fourth slice vector from the plurality of slice vectors in the second slice vector set of the candidate file according to the first similarity corresponding to each slice vector in the first slice vector set of the candidate file.
As one example, at least one fourth slice vector having a first similarity greater than a set similarity threshold may be determined from a plurality of slice vectors in a second set of slice vectors of any candidate file.
Step 207, fusing the fourth slice vectors corresponding to the candidate file to obtain the file vector of the candidate file.
Further, at least one fourth slice vector corresponding to any candidate file is fused to perform vectorization expression on the any candidate file, namely, a file vector of the any candidate file is generated.
And step 208, determining target files from the candidate files according to the third similarity between the file vectors of the candidate files and the session vectors, and sending the target files to the client.
As an example, acquiring a heat value of a candidate file and a social group characteristic value to which a target object belongs, and weighting a third similarity according to the heat value of the candidate file and the social group characteristic value to which the target object belongs to obtain a fourth similarity; sorting the candidate files according to the fourth similarity of the candidate files to obtain candidate file sequences; and determining a target file sequence from the candidate file sequences, and sending the target file sequence to the client, wherein the target file sequence comprises at least one target file. For example, the candidate files are sequenced according to the fourth similarity corresponding to the candidate files from large to small, a file list in a preset number with higher similarity is returned as a target file sequence, and the target file sequence is sent to the client.
It should be noted that, because some servers (e.g., cloud disk) have certain social properties, such as circle function, specific group file sharing function, etc., the file importance of different groups is different for users, and the priority of the group corresponding file with a tighter relationship is higher; the popularity value of any candidate file can be determined according to the browsing times, the collection times, the sharing times, the downloading times or the comment times and the like of the candidate file, and the popularity value of the candidate file can be in positive correlation with the browsing times, the collection times, the sharing times, the downloading times or the comment times and the like of the candidate file, for example, the higher the browsing times of the candidate file, the higher the popularity value of the candidate file; the higher the collection times of any candidate file, the higher the heat value of the candidate file; the more the sharing times of the candidate file are, the higher the heat value of the candidate file is; the more the number of downloads of a candidate file, the higher the heat value of the candidate file.
In the embodiment of the disclosure, when the target file sequence is sent to the client, summary information corresponding to each file in the target file sequence may also be sent to the client, and as an example, according to a second similarity corresponding to each fourth vector of any target file in the target file sequence, a plurality of fourth slice vectors of any target file are ordered to obtain a slice vector sequence corresponding to any target file; acquiring a text slice sequence corresponding to a slice vector sequence of any target file from a text slice set corresponding to any target file; splicing all text slices in a text slice sequence corresponding to any target file to obtain a spliced text; constructing prompt words according to the spliced text and the search keywords corresponding to any target file, and abstracting any target file according to the prompt words corresponding to any target file to obtain abstract information of any target file; and sending the target file sequence and abstract information corresponding to each target file in the target file sequence to the client so as to display each target file in the target file sequence and the abstract corresponding to each target file.
For example, the object file d in the object file sequence is exemplified, and the fourth vectors of the object file d areThe second similarity ρ d,c corresponding to each fourth vector of any object file orders the corresponding plurality of fourth vectors to obtain a slice vector sequence (c 1,c2,...cT) corresponding to the any object file,Sequentially extracting text sequences corresponding to the slice vector sequences (c 1,c2,...cT) corresponding to any target file according to the file slice segmentation rule, and splicing to obtain spliced texts; constructing a prompt word according to the spliced text and the search keyword corresponding to any target file, for example, extracting the keyword from the spliced text, taking the keyword and the search keyword extracted from the spliced text as the prompt word, or taking each word and the search keyword in the spliced text as the prompt word; and further, adopting a semantic model, searching the similarity content according to the prompt words, and extracting abstract summary.
In order to facilitate the user to review the operation records, the search recommendation records can be uploaded to the blockchain stock certificate source tracing, and the original operation records are stored in the centralized database.
In summary, feature dimension reduction is performed on a plurality of slice vectors in a first slice vector of a candidate file according to any candidate file to obtain a second slice vector set, and feature dimension reduction is performed on a session vector according to the plurality of slice vectors in the first slice vector of the candidate file to obtain a session vector after dimension reduction processing; determining a first similarity corresponding to each slice vector in a first slice vector set of a candidate file according to a second similarity between each slice vector in the second slice vector set and the session vector after the dimension reduction processing; determining at least one fourth slice vector from a plurality of slice vectors in a second slice vector set of the candidate file according to the first similarity corresponding to each slice vector in the first slice vector set of the candidate file; fusing the fourth slice vectors corresponding to the candidate files to obtain file vectors of the candidate files; according to the third similarity between the file vector and the session vector of each candidate file, determining the target file from each candidate file, and sending each target file to the client, thereby determining the first similarity corresponding to each slice vector in the second slice vector set of any candidate file by adopting the second slice vector set of any candidate file obtained by the dimension reduction processing and the session vector after the dimension reduction processing, determining the target file from a plurality of candidate files according to the first similarity corresponding to each slice vector in the second slice vector set of any candidate file, and reducing the complexity of determining each target file from a plurality of candidate files.
In order to clearly explain how to acquire the preference vector of the target object of the login client and the keyword vector of the search keyword in the search sentence in the above-described embodiment, the present disclosure proposes another search method.
Fig. 3 is a flowchart of another search method according to an embodiment of the disclosure.
As shown in fig. 3, the search method may include the steps of:
Step 301, in response to receiving a file search request sent by a client, acquiring a file name and a file title of any candidate file in a plurality of candidate files associated with the client, and processing text information of any candidate file according to the file type of any candidate file to obtain a text slice set.
Step 302, a first set of slice vectors is generated from the set of text slices, the file name and the file header.
Step 303, obtaining a behavior sequence of the target object of the login client to the plurality of candidate files in a set period.
For example, a usage behavior sequence (b 1,b2,...bJ) of the user on the file stored on the server in the set period is extracted, and an operation time sequence and a file index sequence corresponding to the usage behavior sequence are respectively (t 1,t2,...tJ) and (r 1,r2,...,rJ), wherein J is the total record number of the operation behaviors.
Step 304, determining a preference vector of the target object according to the behavior sequence.
As an example, the behavior sequence is weighted according to the social group feature value to which the target object belongs, so as to obtain a preference vector of the target object. For example, it can be expressed as the following formula:
Where f 1(bj,tj) represents a user behavior time decay weight function and f 2(bj) represents a user behavior weight function, such as interest preference weights expressed by file browsing, collection, sharing, downloading, commenting, etc., are not the same. Because some servers (e.g., cloud disk) have certain social properties, such as circle function, specific group file sharing function, etc., the importance of files in different groups is not the same for users, and the more closely related group corresponding files generally have higher priority. f 3(uid,rj) represents the group social characteristic weight function (social group feature value) of user u id and file r j, if the file exists in multiple groups of the user, then the greatest weight is taken, The vector representing the file r j is calculated as follows:
In step 305, search keywords in the search sentence are extracted, and vectorized representation is performed on the search keywords to obtain feature vectors of the search keywords.
As an example, search keywords in a search sentence are extracted, and the extracted search keywords may be vectorized using a semantic model to obtain feature vectors of the search keywords. The semantic Model may be a generative pre-training transducer (GENERATIVE PRE-trained Transformer, GPT for short) Model, and a chat universal Language Model (CHAT GENERAL Language Model, chatGLM for short).
And 306, carrying out weighted fusion on the preference vector and the feature vector to obtain a session vector.
As an example, the preference vector and the feature vector are weighted and fused to obtain a session vector, which can be expressed by the following formula:
vue=wbevbe+wsevse
where w be represents the user behavioral interest preference weight, w se represents the search keyword weight, v be represents the preference vector, and v se represents the feature vector of the search keyword.
Step 307, determining at least one target file from the candidate files according to the session vector and the first similarity between the slice vectors in the first slice vector set of the candidate files, and sending the target files to the client.
In summary, the behavior sequence of the target object logging in the client to the candidate file in a set period is obtained; determining a preference vector of the target object according to the behavior sequence; extracting search keywords in search sentences, and vectorizing the search keywords to obtain feature vectors of the search keywords, so that preference vectors of target objects can be effectively determined according to behavior sequences of the target objects on a plurality of candidate files within a set period, vectorizing the search keywords, and effectively determining feature vectors of the search keywords, thereby realizing that the preference and the search keywords of users are considered on the basis of accurately representing the semantics of candidate files, accurately recalling files required by the users from the candidate files, meeting personalized search requirements of different users, and improving search experience of the users.
To clearly illustrate how the text information of any candidate file is processed according to the file type of any candidate file in the above embodiment to obtain a text slice set, the present disclosure proposes another search method.
Fig. 4 is a flowchart of another search method according to an embodiment of the disclosure.
As shown in fig. 4, the search method may include the steps of:
Step 401, in response to receiving a file search request sent by a client, acquiring a file name and a file title of any candidate file from a plurality of candidate files associated with the client.
Step 402, extracting text information of any candidate file according to the file type of any candidate file.
As an example, in response to the file type of any candidate file being a picture, extracting a plurality of image features from any candidate file, and converting any image feature of the plurality of image features of any candidate file into corresponding text description information; and determining the text information of any candidate file according to the text description information corresponding to any image feature in the plurality of image features of any candidate file. For example, the text description information corresponding to any one of the plurality of image features of any one candidate file is spliced to obtain the text information of any one candidate file.
As another example, in response to the file type of any candidate file being audio, audio data in any candidate file is extracted; converting the audio data in any candidate file into corresponding text description information; and converting the audio data in any candidate file into corresponding text description information, and determining the text information of any candidate file. For example, text description information corresponding to the audio data in any candidate file is used as the text information of any candidate file.
It should be noted that, in order to satisfy the personalized permission configuration, when the file search is recommended, a specific file may be designated for searching, for example, a file belonging to a certain group, a file in a specific format, a file in a search permission, etc. may be selected, and at the same time, it may be set to exclude the specific file for searching so as to satisfy the personalized permission configuration,
As one example, from among a plurality of file albums stored, a target file album matching the search statement is determined from the search statement in the file search request; files in the target file album are taken as a plurality of candidate files associated with the client.
As another example, extracting a file format of a first file to be searched in a search sentence, and obtaining a plurality of first files to be searched matched with the file format from a plurality of stored files; and taking the plurality of first files to be searched as a plurality of candidate files associated with the client.
As yet another example, a search right matching the target object is obtained, and a plurality of second files to be searched matching the search right are obtained from the stored plurality of files; and taking the plurality of second files to be searched as a plurality of candidate files associated with the client.
And step 403, performing text segmentation on the text information of any candidate file according to the length information and the set slice length information of the text information of any candidate file so as to obtain a text slice set of any candidate file.
Wherein the text slice set comprises a plurality of text slices.
In order to keep the semantic information between the adjacent slices smooth, text information repetition can exist between the adjacent slices, namely, the tail of the previous slice and the head of the next slice are kept to be repeated, as an example, the length information of the text information of any candidate file is L d, the slice length information is L max, when the length information L d of the text information of any candidate file is less than or equal to L max, the segmentation is not performed, and at the moment, the text slice set corresponding to the text information of any candidate file only comprises 1 text slice; on the contrary, the segmentation is performed in a manner of smooth movement and overlapping of adjacent slices, and the number of text slices in the text slice set corresponding to the text information of any candidate file may beWherein, Representing an upward rounding.
As another example, the text information of any candidate file may be text-segmented according to paragraphs to obtain a text slice set of any candidate file, for example, the text information of any candidate file includes 5 text paragraphs, each paragraph is a text slice, and the text slice set of any candidate file may include 5 text slices.
Step 404, generating a first set of slice vectors according to the set of text slices, the file name and the file header.
Step 405, obtain the preference vector of the target object of the login client and the feature vector of the search keyword in the search statement in the search request, and fuse the preference vector with the feature vector to obtain the session vector.
Step 406, determining at least one target file from the candidate files according to the session vector and the first similarity between the slice vectors in the first slice vector set of the candidate files, and sending the target files to the client.
In summary, extracting text information of any candidate file according to the file type of any candidate file; according to the length information of the text information of any candidate file and the set slice length information, text segmentation is carried out on the text information of any candidate file so as to obtain a text slice set of any candidate file; the text slice set comprises a plurality of text slices, so that the candidate files are vectorized and represented in slice level based on file types of the candidate files and by considering file names and file titles, session vectors obtained by combining preference vectors of target objects and feature vectors of search keywords in search sentences are combined, the plurality of target files are determined, preference and search keywords of users are considered on the basis of accurately representing semantics of the candidate files, files required by the users are accurately recalled from the candidate files, personalized search requirements of different users are met, and search experience of the users is improved.
On the basis of any embodiment of the present disclosure, taking a server as a cloud disk as an example, an implementation flow of the present disclosure may be as shown in fig. 5, and mainly includes the following steps:
1. the file slice corpus (text slice set) generating step may be as shown in fig. 6, and specifically may include the following steps:
(1) Text information extraction: taking a file D (D is more than or equal to 1 and less than or equal to D) as an example, firstly judging the file type, and if the file belongs to a text file, directly extracting text information; if the scene information belongs to the picture file, acquiring text information such as scene tags by means of a picture generation Wen Yuyi analysis model; if the text information belongs to the audio file, acquiring the text information by means of the audio-to-text semantic analysis model, otherwise, treating the text information as an invalid file. It should be noted that, the graphics text and the audio text have a mature large language model technology or tools such as API interfaces, for example, GPT-4 and DALL-E, whisper, which will not be described here.
(2) Text information preprocessing: cleaning the text information obtained in the step 1), including preprocessing operations such as removing stop words;
(3) Text slice generation: to keep the semantic information between adjacent slices smooth, there is a repetition of text information between adjacent slices, i.e. the tail of the previous slice and the head of the next slice remain repeated. Assuming that the text information length obtained in the step 2) is L d, the maximum length of the to-be-segmented slice is L max, and the repeated text length between adjacent slices is L cp, the segmentation is performed according to the following principle: when the text information length is not greater than the maximum slice length, namely L d≤Lmax, no segmentation is performed, and the file has only N d =1 slices and the slice length is L d, otherwise, the file is segmented in a smooth movement mode and adjacent slices are overlapped, and the number of the slices is Wherein, The representation is rounded up.
2. File slice vectorization
(1) Name heading vectorization: extracting the name and the title of a file D (D is more than or equal to 1 and less than or equal to D), and carrying out vectorization representation through a semantic analysis model to obtain corresponding vectors v file,d and v title,d;
(2) Slice content vectorization: aiming at a slice corpus (text slice set) s d,c(1≤c≤Nd of file segmentation, sequentially inputting a semantic analysis model for vectorization to obtain a corresponding vector v chip,d,c;
(3) Slice vector weighted fusion: let the weights corresponding to the file name, title and slice be w file,d、wtitle,d and w chip,d respectively, then the vector after weighted fusion of slice vectors is:
(4) Slice vector storage: and (3) storing the slice vector obtained in the step (3) and the corresponding expected content thereof into a vector database, so as to facilitate data persistence and subsequent vector retrieval operation.
3. File slice feature dimension reduction
(1) Clustering slice vectors: clustering the file slice vectors v d,c(1≤d≤D,1≤c≤Nd) according to a preset rule, for example, adopting a K-means algorithm;
(2) Random sampling forms a sample set: assuming that the total clustering number after the clustering of the slices is K, the total slice vector number of the clustering K is N cluster,k, the number of the sample set slices to be extracted is N sample, and the vector number N sample,k randomly sampled in the clustering K is
Wherein, The representation is rounded upwards, randomly extracted slices form a sample set vector X, and the number of samples is updated to N sample=ΣNsample,k;
(3) Let the ith vector in sample vector set X be X i(1≤i≤Nsample), then the sample mean vector is
(4) Aiming at all vectors in the sample vector set X, namely Q= (X 1-vmean,x2-vmean,...xNample-vmean), calculating a covariance matrix QQ T, carrying out eigenvalue decomposition on the covariance matrix QQ T, and extracting eigenvectors corresponding to the largest first M eigenvalues to construct an eigenvector matrix P;
(5) According to the sample mean value vector and the feature vector matrix, the feature dimension reduction processing of the slice vector can be specifically expressed as follows:
4. User search session vectorized expression
(1) Extracting user cloud disk historical behaviors: extracting a user cloud disk file use behavior sequence (b 1,b2,...bJ) within a set time period, wherein the corresponding operation time sequence and file index sequence are (t 1,t2,...tJ) and (r 1,r2,...,rJ) respectively, and J is the total record number of operation behaviors;
(2) User interest preference vector expression: according to the step 1) user use behaviors, weighting and calculating corresponding user interest preference vector expression:
Where f 1(bj,tj) represents a user behavior time decay weight function and f 2(bj) represents a user behavior weight function, such as interest preference weights expressed by file browsing, collection, sharing, downloading, commenting, etc., are not the same. Because some servers (e.g., cloud disk) have certain social properties, such as circle function, specific group file sharing function, etc., the importance of files in different groups is not the same for users, and the more closely related group corresponding files generally have higher priority. f 3(uid,rj) represents the group social characteristic weight function (social group feature value) of user u id and file r j, if the file exists in multiple groups of the user, then the greatest weight is taken, The vector representing file rj is calculated as follows:
(3) Search keyword vectorization expression: vectorizing the search keywords of the user by means of the same potential semantic analysis model as in the file slice vectorizing method to obtain a search keyword vector (feature vector) v se;
(4) User search session vectorization expression: weighting by user-history behavioral interest preference vectors and search keyword vectors (feature vectors), i.e.
vue=wbevbe+wsevse
Where w be represents the user behavioral interest preference weight and w se represents the search keyword weight.
5. File slice similarity retrieval
(1) File slice similarity calculation: feature-based dimension-reduced slice vectorAnd user search session vectorCalculating the similarity ρ d,c, i.e
(2) File slice screening: presetting a similarity threshold rho th, and selecting file slices meeting the following conditions
6. Document similarity retrieval
(1) File vector expression: according to the preliminary retrieval result of the file sliceWeighted fusion forms file vectorized representations I.e.
(2) File similarity calculation: sequentially comparing the similarity of the file vector in the step 1) and the user search session vector according to the following steps
7. File search recommendation list ranking
(1) File weighted similarity calculation: weighting file similarity considering file popularity and social group characteristicsI.e.
(2) Returning a file search list: and sequencing the file retrieval results according to the weighted similarity from large to small, and returning file lists in a preset number with higher similarity as a current search session recommended file list.
As a possible implementation manner of the embodiment of the present disclosure, when a cloud disk search personalized recommended file list is returned, a prompt word may be constructed according to a high-similarity slice content corpus, and summary is performed by means of a latent semantic model (such as a GPT, chatGLM, etc.), and the summary content is presented, so that a user can check and confirm a search result. The method is as follows
(1) Slice index extraction: taking a file d in the personalized recommendation list as an example, extracting a file corresponding slice retrieval result in the file slice similarity retrieval methodAnd ordered according to the similarity ρ d,c, a slice index (slice vector sequence) of (c 1,c2,...cT) is obtained, and
(2) Corpus extraction: sequentially extracting text sequences corresponding to the slice indexes in the step 1) according to a file slice segmentation rule, and segmenting the slices by using separators;
(3) Summary of semantic model abstract: and constructing prompt words according to the corpus and the search keywords, searching for similarity contents based on the potential semantic model, and extracting abstract summary.
As a possible implementation manner of the embodiments of the present disclosure, when searching and recommending a cloud disc personalized file, a specific file may be specified for searching, for example, a file belonging to a certain group, a file with a specific format, etc. may be selected, and meanwhile, specific files may be excluded for searching, so as to satisfy the personalized permission configuration.
As a possible implementation manner of the embodiments of the present disclosure, in the method for generating a document slice corpus, segmentation may be performed according to a paragraph structure, and there is no need to keep the lengths of the slice text sequences as equal as possible.
For convenience in user review of operation records, audit, log view, and other business operations, as a possible implementation manner of the embodiment of the disclosure, as shown in fig. 7, the cloud disk may upload the personalized search recommendation record of the user file to the blockchain provenance tracing, where the original operation record is stored in the centralized database.
In order to implement the embodiments shown in fig. 1 to 7 described above, the present disclosure proposes a search apparatus.
Fig. 8 is a schematic structural diagram of a search device according to an embodiment of the disclosure.
As shown in fig. 8, the search apparatus 800 includes: a first processing module 810, a generating module 820, a fusing module 830, and a determining module 840.
The first processing module 810 is configured to, in response to receiving a file search request sent by a client, obtain, for any candidate file of a plurality of candidate files associated with the client, a file name and a file title of the any candidate file, and process text information of the any candidate file according to a file type of the any candidate file, so as to obtain a text slice set; a generating module 820, configured to generate a first set of slice vectors according to the set of text slices, the file name and the file title; the fusion module 830 is configured to obtain a preference vector of a target object of the login client and a feature vector of a search keyword in a search statement in the search request, and fuse the preference vector with the feature vector to obtain a session vector; the determining module 840 is configured to determine at least one target file from the candidate files according to the session vector and the first similarity between the slice vectors in the first set of slice vectors of the candidate files, and send the target files to the client.
As a possible implementation manner of the embodiments of the present disclosure, the determining module 840 is configured to perform feature dimension reduction on a plurality of slice vectors in a first slice vector of a candidate file for any candidate file to obtain a second slice vector set, and perform feature dimension reduction on a session vector according to the plurality of slice vectors in the first slice vector of the candidate file to obtain a session vector after dimension reduction processing; determining a first similarity corresponding to each slice vector in the first slice vector set of the candidate file according to the second similarity between each slice vector in the second slice vector set and the session vector after the dimension reduction processing; determining at least one fourth slice vector from a plurality of slice vectors in a second slice vector set of the candidate file according to the first similarity corresponding to each slice vector in the first slice vector set of the candidate file; fusing the fourth slice vectors corresponding to the candidate files to obtain file vectors of the candidate files; and determining target files from the candidate files according to a third similarity between the file vectors and the session vectors of the candidate files, and sending the target files to the client.
As one possible implementation manner of the embodiments of the present disclosure, the determining module 840 is further configured to cluster, for any candidate file, a plurality of slice vectors in a first slice vector of the candidate file, so as to obtain a plurality of vector clusters; according to the ratio of the number of slice vectors in each vector cluster to the set sampling number, extracting sample vectors with the set sample vector number from each vector cluster to generate a sample vector set; and performing feature dimension reduction on a plurality of slice vectors in the first slice vector of the candidate file according to the sample vector set to obtain a second slice vector set.
As one possible implementation manner of the embodiments of the present disclosure, the determining module 840 is further configured to determine a sample mean vector of the sample vector set according to the number of sample vectors in the sample vector set; updating each sample vector in the sample vector set according to the sample mean value vector to obtain an updated sample vector set; determining a covariance matrix of the updated sample vector set; and performing feature dimension reduction on a plurality of slice vectors in the first slice vector of the candidate file according to the covariance matrix and the sample mean vector to obtain a second slice vector set.
As one possible implementation manner of the embodiments of the present disclosure, the determining module 840 is further configured to obtain a difference vector between the session vector and a sample mean vector of the candidate file; and obtaining the conversation vector after the dimension reduction processing according to the product of the difference vector of the candidate file and the covariance matrix.
As a possible implementation manner of the embodiment of the present disclosure, the determining module 840 is further configured to obtain a popularity value of the candidate file and a social group feature value to which the target object belongs, and weight the third similarity according to the popularity value of the candidate file and the social group feature value to which the target object belongs, so as to obtain a fourth similarity; sorting the candidate files according to the fourth similarity of the candidate files to obtain a candidate file sequence; and determining a target file sequence from the candidate file sequences, and sending the target file sequence to the client, wherein the target file sequence comprises at least one target file.
As a possible implementation manner of the embodiments of the present disclosure, the determining module 840 is further configured to sort the fourth slice vectors according to the second similarity corresponding to the fourth vectors of any one of the target files in the sequence of target files, so as to obtain a sequence of slice vectors; acquiring a text slice sequence corresponding to the slice vector sequence from the text slice set; splicing all the text slices in the text slice sequence to obtain a spliced text; constructing a prompt word according to the spliced text and the search keyword, and abstracting any target file according to the prompt word to obtain abstract information of any target file; and sending the target file sequence and abstract information corresponding to each target file in the target file sequence to the client so as to display each target file in the target file sequence and the abstract corresponding to each target file.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: and a transmitting module.
The sending module is used for sending the target file sequence and the summary information corresponding to each file in the target file sequence to the blockchain so as to store the target file sequence and the summary information corresponding to each file in the target file sequence.
As one possible implementation manner of the embodiments of the present disclosure, a generating module 820 is configured to perform vectorization representation on each text slice, file name, and file title in the text slice set, to obtain an initial slice vector set, a file name vector, and a file title vector; and fusing the initial slice vector set, the file name vector and the file header vector to obtain a first slice vector set.
As a possible implementation manner of the embodiments of the present disclosure, a fusion module 830 is configured to obtain a behavior sequence of a target object of a login client on a candidate file within a set period; determining a preference vector of the target object according to the behavior sequence; extracting search keywords in the search sentences, and carrying out vectorization representation on the search keywords to obtain feature vectors of the search keywords.
As a possible implementation manner of the embodiment of the present disclosure, the first processing module 810 is configured to extract text information of any candidate file according to a file type of the any candidate file; according to the length information of the text information of any candidate file and the set slice length information, text segmentation is carried out on the text information of any candidate file so as to obtain a text slice set of any candidate file; wherein the text slice set comprises a plurality of text slices.
As a possible implementation manner of the embodiment of the present disclosure, the first processing module 810 is further configured to extract a plurality of image features from any candidate file in response to the file type of any candidate file being a picture, and convert any image feature in the plurality of image features of any candidate file into corresponding text description information; and determining the text information of any candidate file according to the text description information corresponding to each image characteristic of any candidate file.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: an extraction module and a conversion module.
The extraction module is used for responding to the file type of any candidate file as audio, and extracting audio data in any candidate file; the conversion module is used for converting the audio data in any candidate file into corresponding text description information; the determining module 840 is configured to determine text information of any candidate file according to text description information corresponding to the audio data in any candidate file.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: and a second processing module.
The second processing module is used for determining a target file album matched with the search statement from a plurality of stored file albums according to the search statement in the file search request; and taking the files in the target file album as a plurality of candidate files associated with the client.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: and a third processing module.
The third processing module is used for extracting the file format of a first file to be searched in the search statement, and acquiring a plurality of first files to be searched matched with the file format from a plurality of stored files; and taking the plurality of first files to be searched as a plurality of candidate files associated with the client.
As one possible implementation of the embodiment of the present disclosure, the search apparatus 800 further includes: and a fourth processing module.
The fourth processing module is used for acquiring the searching authority matched with the target object and acquiring a plurality of second files to be searched matched with the searching authority from the stored files; and taking the plurality of second files to be searched as a plurality of candidate files associated with the client.
In response to receiving a file search request sent by a client, the search device of the embodiment of the disclosure obtains a file name and a file title of any candidate file for any candidate file in a plurality of candidate files associated with the client, and processes text information of the any candidate file according to the file type of the any candidate file to obtain a text slice set; generating a first slice vector set according to the text slice set, the file name and the file title; obtaining a preference vector of a target object of a login client and a feature vector of a search keyword in a search statement in a search request, and fusing the preference vector and the feature vector to obtain a session vector; according to the session vector and the first similarity between slice vectors in the first slice vector set of each candidate file, at least one target file is determined from each candidate file, and each target file is sent to a client, so that the candidate files are subjected to vectorization representation of slice levels based on file types of the candidate files and in consideration of file names and file titles, and simultaneously the session vector obtained by combining the preference vector of the target object and the feature vector of the search keyword in the search statement is combined, at least one target file is determined from a plurality of candidate files, the purpose that the preference and the search keyword of a user are considered on the basis of accurately characterizing the semantics of the candidate files is achieved, the files required by the user are accurately recalled from each candidate file, personalized search requirements of different users are met, and the search experience of the user is improved.
It should be noted that the foregoing explanation of the searching method embodiment is also applicable to the searching apparatus of this embodiment, and will not be repeated here.
In an exemplary embodiment, an electronic device is also presented.
Wherein, electronic equipment includes:
A processor;
A memory for storing processor-executable instructions;
Wherein the processor is configured to execute instructions to implement a search method as set forth in any of the foregoing embodiments.
As an example, fig. 9 is a schematic structural diagram of an electronic device 900 according to an exemplary embodiment of the disclosure, where, as shown in fig. 9, the electronic device 900 may further include:
Memory 910 and processor 920, bus 930 connecting the different components (including memory 910 and processor 920), memory 910 storing a computer program that when executed by processor 920 implements the search method described in the embodiments of the present disclosure.
Bus 930 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 900 typically includes a variety of electronic device readable media. Such media can be any available media that is accessible by electronic device 900 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 910 may also include computer-system readable media in the form of volatile memory such as Random Access Memory (RAM) 940 and/or cache memory 950. The server 900 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 960 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, commonly referred to as a "hard disk drive"). Although not shown in fig. 9, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 930 via one or more data medium interfaces. Memory 910 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the various embodiments of the disclosure.
A program/utility 980 having a set (at least one) of program modules 970 may be stored, for example, in memory 910, such program modules 970 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 970 generally perform the functions and/or methods in the embodiments described in this disclosure.
The electronic device 900 may also communicate with one or more external devices 990 (e.g., keyboard, pointing device, display 991, etc.), one or more devices that enable a user to interact with the electronic device 900, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 992. Also, the electronic device 900 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter 993. As shown, the network adapter 993 communicates with other modules of the electronic device 900 over the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 900, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processor 920 performs various functional applications and data processing by running programs stored in the memory 910.
It should be noted that, the implementation process and the technical principle of the electronic device in this embodiment refer to the foregoing explanation of the search method in the embodiment of the disclosure, and are not repeated herein.
In an exemplary embodiment, a computer readable storage medium is also provided, e.g. a memory, comprising instructions executable by a processor of an electronic device to perform the search method set forth in any of the embodiments described above. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In an exemplary embodiment, a computer program product is also provided, comprising a computer program/instruction, characterized in that the computer program/instruction, when executed by a processor, implements the search method proposed by any of the above embodiments.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (17)

1.一种搜索方法,其特征在于,所述方法包括:1. A search method, characterized in that the method comprises: 响应于接收到客户端发送的文件搜索请求,针对与所述客户端关联的多个候选文件中的任一候选文件,获取所述任一候选文件的文件名称和文件标题,并根据所述任一候选文件的文件类型,对所述任一候选文件的文本信息进行处理,以得到文本切片集合;In response to receiving a file search request sent by a client, for any candidate file among a plurality of candidate files associated with the client, obtaining a file name and a file title of the any candidate file, and processing text information of the any candidate file according to a file type of the any candidate file to obtain a text slice set; 根据所述文本切片集合、所述文件名称和所述文件标题,生成第一切片向量集合;Generate a first slice vector set according to the text slice set, the file name and the file title; 获取登录所述客户端的目标对象的偏好向量和所述文件搜索请求中搜索语句中的搜索关键词的特征向量,并将所述偏好向量与所述特征向量进行融合,得到会话向量;Obtaining a preference vector of a target object logging into the client and a feature vector of a search keyword in a search statement in the file search request, and fusing the preference vector with the feature vector to obtain a session vector; 根据所述会话向量和各所述候选文件的第一切片向量集合中各切片向量之间的第一相似度,从各所述候选文件中确定至少一个目标文件,将各所述目标文件发送至所述客户端;Determine at least one target file from each of the candidate files according to a first similarity between the session vector and each slice vector in a first slice vector set of each of the candidate files, and send each of the target files to the client; 所述根据所述会话向量和各所述候选文件的第一切片向量集合中各切片向量之间的第一相似度,从各所述候选文件中确定至少一个目标文件,将各所述目标文件发送至所述客户端,包括:The determining at least one target file from each of the candidate files according to the first similarity between the session vector and each slice vector in the first slice vector set of each of the candidate files, and sending each of the target files to the client, comprises: 针对任意所述候选文件,对所述候选文件的第一切片向量中的多个切片向量进行特征降维,以得到第二切片向量集合,并根据所述候选文件的第一切片向量中的多个切片向量,对所述会话向量进行特征降维,以得到降维处理后的会话向量;For any of the candidate files, perform feature dimensionality reduction on multiple slice vectors in the first slice vector of the candidate file to obtain a second slice vector set, and perform feature dimensionality reduction on the session vector according to the multiple slice vectors in the first slice vector of the candidate file to obtain a session vector after dimensionality reduction processing; 根据所述第二切片向量集合中的各切片向量与所述降维处理后的会话向量之间的第二相似度,确定所述候选文件的第一切片向量集合中各切片向量对应的第一相似度;Determine, according to a second similarity between each slice vector in the second slice vector set and the session vector after the dimensionality reduction processing, a first similarity corresponding to each slice vector in the first slice vector set of the candidate file; 根据所述候选文件的第一切片向量集合中各切片向量对应的第一相似度,从所述候选文件的第二切片向量集合的多个切片向量中,确定至少一个第四切片向量;Determining at least one fourth slice vector from a plurality of slice vectors in the second slice vector set of the candidate file according to the first similarity corresponding to each slice vector in the first slice vector set of the candidate file; 对所述候选文件对应的各所述第四切片向量进行融合,以得到所述候选文件的文件向量;Merging the fourth slice vectors corresponding to the candidate file to obtain a file vector of the candidate file; 根据各所述候选文件的文件向量和所述会话向量之间的第三相似度,从各所述候选文件中确定所述目标文件,将各所述目标文件发送至客户端;Determine the target file from each of the candidate files according to a third similarity between the file vector of each of the candidate files and the session vector, and send each of the target files to the client; 所述针对任意所述候选文件,对所述候选文件的第一切片向量中的多个切片向量进行特征降维,以得到第二切片向量集合,包括:For any of the candidate files, performing feature dimension reduction on multiple slice vectors in the first slice vector of the candidate file to obtain a second slice vector set includes: 针对任意所述候选文件,对所述候选文件的第一切片向量中的多个切片向量进行聚类,以得到多个向量聚簇;For any of the candidate files, clustering multiple slice vectors in the first slice vector of the candidate file to obtain multiple vector clusters; 根据各所述向量聚簇中切片向量的数量与设定抽样数量的占比,从各所述向量聚簇中,抽取设定样本向量数量的样本向量,以生成样本向量集;Extracting a set number of sample vectors from each of the vector clusters according to a ratio of the number of slice vectors in each of the vector clusters to a set number of sampling vectors, so as to generate a sample vector set; 根据所述样本向量集,对所述候选文件的第一切片向量中的多个切片向量进行特征降维,以得到第二切片向量集合。According to the sample vector set, feature dimension reduction is performed on multiple slice vectors in the first slice vector of the candidate file to obtain a second slice vector set. 2.根据权利要求1所述的方法,其特征在于,所述根据所述样本向量集,对所述候选文件的第一切片向量中的多个切片向量进行特征降维,以得到第二切片向量集合,包括:2. The method according to claim 1, characterized in that the step of performing feature dimension reduction on multiple slice vectors in the first slice vector of the candidate file according to the sample vector set to obtain a second slice vector set comprises: 根据所述样本向量集中样本向量的数量,确定所述样本向量集的样本均值向量;Determining a sample mean vector of the sample vector set according to the number of sample vectors in the sample vector set; 根据所述样本均值向量,对所述样本向量集中的各样本向量进行更新,以得到更新后的样本向量集;According to the sample mean vector, each sample vector in the sample vector set is updated to obtain an updated sample vector set; 确定所述更新后的样本向量集的协方差矩阵;Determining a covariance matrix of the updated sample vector set; 根据所述协方差矩阵和所述样本均值向量,对所述候选文件的第一切片向量中的多个切片向量进行特征降维,以得到第二切片向量集合。According to the covariance matrix and the sample mean vector, feature dimension reduction is performed on multiple slice vectors in the first slice vector of the candidate file to obtain a second slice vector set. 3.根据权利要求2所述的方法,其特征在于,所述根据所述候选文件的第一切片向量中的多个切片向量,对所述会话向量进行特征降维,以得到降维处理后的会话向量,包括:3. The method according to claim 2, characterized in that the step of performing feature dimension reduction on the session vector according to multiple slice vectors in the first slice vector of the candidate file to obtain a session vector after dimension reduction comprises: 获取所述会话向量与所述候选文件的样本均值向量之间的差值向量;Obtaining a difference vector between the session vector and the sample mean vector of the candidate file; 根据所述候选文件的差值向量和协方差矩阵的乘积,得到降维处理后的会话向量。The session vector after dimensionality reduction processing is obtained according to the product of the difference vector and the covariance matrix of the candidate file. 4.根据权利要求1所述的方法,其特征在于,所述根据各所述候选文件的文件向量和所述会话向量之间的第三相似度,从各所述候选文件中确定所述目标文件,将各所述目标文件发送至客户端,包括:4. The method according to claim 1, characterized in that the determining the target file from each of the candidate files according to the third similarity between the file vector of each of the candidate files and the session vector, and sending each of the target files to the client, comprises: 获取所述候选文件的热度值和所述目标对象所属的社交群组特征值,并根据所述候选文件的热度值和所述目标对象所属的社交群组特征值,对所述第三相似度进行加权,以得到第四相似度;Obtaining the popularity value of the candidate file and the characteristic value of the social group to which the target object belongs, and weighting the third similarity according to the popularity value of the candidate file and the characteristic value of the social group to which the target object belongs, so as to obtain a fourth similarity; 根据各所述候选文件的第四相似度,对各所述候选文件进行排序,以得到候选文件序列;sorting the candidate files according to the fourth similarities of the candidate files to obtain a candidate file sequence; 从所述候选文件序列中,确定目标文件序列,将所述目标文件序列发送至所述客户端,其中,所述目标文件序列中包括所述至少一个目标文件。A target file sequence is determined from the candidate file sequence, and the target file sequence is sent to the client, wherein the target file sequence includes the at least one target file. 5.根据权利要求4所述的方法,其特征在于,所述从所述候选文件序列中,确定目标文件序列,将所述目标文件序列发送至所述客户端,包括:5. The method according to claim 4, characterized in that the step of determining a target file sequence from the candidate file sequence and sending the target file sequence to the client comprises: 根据所述目标文件序列任一目标文件的各第四向量对应的第二相似度,对所述任一目标文件的各所述第四切片向量进行排序,以得到切片向量序列;sorting the fourth slice vectors of any target file in the target file sequence according to the second similarity corresponding to the fourth vectors of any target file in the target file sequence to obtain a slice vector sequence; 从所述文本切片集合中,获取所述切片向量序列对应的文本切片序列;Obtaining, from the text slice set, a text slice sequence corresponding to the slice vector sequence; 将所述文本切片序列中的各文本切片进行拼接,以得到拼接文本;Splicing the text slices in the text slice sequence to obtain a spliced text; 根据所述拼接文本和所述搜索关键词,构建提示词,并根据所述提示词,对所述任一目标文件进行摘要提取,以得到所述任一目标文件的摘要信息;Constructing a prompt word according to the concatenated text and the search keyword, and extracting a summary of any target file according to the prompt word to obtain summary information of any target file; 将所述目标文件序列和所述目标文件序列中各目标文件对应的摘要信息发送至所述客户端,以对所述目标文件序列中各目标文件和各目标文件对应的摘要进行展示。The target file sequence and summary information corresponding to each target file in the target file sequence are sent to the client to display each target file in the target file sequence and the summary corresponding to each target file. 6.根据权利要求5所述的方法,其特征在于,所述方法还包括:6. The method according to claim 5, characterized in that the method further comprises: 将所述目标文件序列和所述目标文件序列中各文件对应的摘要信息发送至区块链,以对所述目标文件序列和所述目标文件序列中各文件对应的摘要信息进行存储。The target file sequence and summary information corresponding to each file in the target file sequence are sent to the blockchain to store the target file sequence and summary information corresponding to each file in the target file sequence. 7.根据权利要求1所述的方法,其特征在于,所述根据所述文本切片集合、所述文件名称和所述文件标题,生成第一切片向量集合,包括:7. The method according to claim 1, characterized in that the step of generating a first slice vector set according to the text slice set, the file name and the file title comprises: 分别对所述文本切片集合中的各文本切片、所述文件名称、所述文件标题进行向量化表示,得到初始切片向量集合、文件名称向量和文件标题向量;Respectively vectorize each text slice in the text slice set, the file name, and the file title to obtain an initial slice vector set, a file name vector, and a file title vector; 将所述初始切片向量集合、所述文件名称向量和所述文件标题向量进行融合,以得到第一切片向量集合。The initial slice vector set, the file name vector and the file title vector are merged to obtain a first slice vector set. 8.根据权利要求1所述的方法,其特征在于,所述获取登录所述客户端的目标对象的偏好向量和所述搜索语句中的搜索关键词的关键词向量,包括:8. The method according to claim 1, characterized in that the step of obtaining a preference vector of a target object logging into the client and a keyword vector of a search keyword in the search statement comprises: 获取登录所述客户端的目标对象在设定时段内对所述候选文件的行为序列;Obtaining a behavior sequence of a target object logging into the client on the candidate file within a set time period; 根据所述行为序列,确定所述目标对象的偏好向量;Determining a preference vector of the target object according to the behavior sequence; 提取所述搜索语句中的搜索关键词,并对所述搜索关键词进行向量化表示,以得到所述搜索关键词的特征向量。The search keywords in the search statement are extracted, and the search keywords are vectorized to obtain feature vectors of the search keywords. 9.根据权利要求1所述的方法,其特征在于,所述根据所述任一候选文件的文件类型,对所述任一候选文件的文本信息进行处理,以得到文本切片集合,包括:9. The method according to claim 1, characterized in that the processing of the text information of any candidate file according to the file type of any candidate file to obtain a text slice set comprises: 根据所述任一候选文件的文件类型,提取所述任一候选文件的文本信息;Extracting text information of any candidate file according to the file type of any candidate file; 根据所述任一候选文件的文本信息的长度信息和设定切片长度信息,对所述任一候选文件的文本信息进行文本分割,以得到所述任一候选文件的文本切片集合;其中,所述文本切片集合中包括多个文本切片。According to the length information of the text information of any candidate file and the set slice length information, the text information of any candidate file is segmented to obtain a text slice set of any candidate file; wherein the text slice set includes multiple text slices. 10.根据权利要求9所述的方法,其特征在于,所述根据所述任一候选文件的文件类型,提取所述任一候选文件的文本信息,包括:10. The method according to claim 9, characterized in that extracting text information of any candidate file according to the file type of any candidate file comprises: 响应于所述任一候选文件的文件类型为图片,从所述任一候选文件中提取多个图像特征,并将所述任一候选文件的多个图像特征中的任一图像特征,转换为对应的文本描述信息;In response to the file type of any candidate file being a picture, extracting multiple image features from the any candidate file, and converting any image feature of the multiple image features of the any candidate file into corresponding text description information; 根据所述任一候选文件的各所述图像特征对应的文本描述信息,确定所述任一候选文件的文本信息。The text information of any candidate file is determined according to the text description information corresponding to each of the image features of any candidate file. 11.根据权利要求10所述的方法,其特征在于,所述方法还包括:11. The method according to claim 10, characterized in that the method further comprises: 响应于所述任一候选文件的文件类型为音频,提取所述任一候选文件中的音频数据;In response to the file type of any candidate file being audio, extracting audio data from any candidate file; 将所述任一候选文件中的音频数据转换为对应的文本描述信息;Convert the audio data in any candidate file into corresponding text description information; 根据所述任一候选文件中的音频数据对应的文本描述信息,确定所述任一候选文件的文本信息。The text information of any candidate file is determined according to the text description information corresponding to the audio data in any candidate file. 12.根据权利要求1所述的方法,其特征在于,所述根据所述任一候选文件的文件类型,对所述任一候选文件的文本信息进行处理,以得到文本切片集合之前,所述方法还包括:12. The method according to claim 1, characterized in that before processing the text information of any candidate file according to the file type of any candidate file to obtain a text slice set, the method further comprises: 根据所述文件搜索请求中的搜索语句,从已存储的多个文件专辑中,确定与所述搜索语句匹配的目标文件专辑;According to the search statement in the file search request, determining a target file album matching the search statement from a plurality of stored file albums; 将所述目标文件专辑中的文件,作为与所述客户端关联的多个候选文件。The files in the target file album are used as multiple candidate files associated with the client. 13.根据权利要求1所述的方法,其特征在于,所述根据所述任一候选文件的文件类型,对所述任一候选文件的文本信息进行处理,以得到文本切片集合之前,所述方法还包括:13. The method according to claim 1, characterized in that before processing the text information of any candidate file according to the file type of any candidate file to obtain a text slice set, the method further comprises: 提取所述搜索语句中第一待搜索文件的文件格式,从已存储的多个文件中,获取与所述文件格式匹配的多个第一待搜索文件;Extracting the file format of the first file to be searched in the search statement, and acquiring a plurality of first files to be searched matching the file format from a plurality of stored files; 将所述多个第一待搜索文件,作为与所述客户端关联的多个候选文件。The multiple first files to be searched are used as multiple candidate files associated with the client. 14.根据权利要求1所述的方法,其特征在于,所述根据所述任一候选文件的文件类型,对所述任一候选文件的文本信息进行处理,以得到文本切片集合之前,所述方法还包括:14. The method according to claim 1, characterized in that before processing the text information of any candidate file according to the file type of any candidate file to obtain a text slice set, the method further comprises: 获取与所述目标对象匹配的搜索权限,并从已存储的多个文件中,获取与所述搜索权限匹配的多个第二待搜索文件;Acquire a search permission that matches the target object, and acquire a plurality of second files to be searched that match the search permission from a plurality of stored files; 将所述多个第二待搜索文件,作为与所述客户端关联的多个候选文件。The multiple second files to be searched are used as multiple candidate files associated with the client. 15.一种搜索装置,其特征在于,所述装置包括:15. A search device, characterized in that the device comprises: 第一处理模块,用于响应于接收到客户端发送的文件搜索请求,针对与所述客户端关联的多个候选文件中的任一候选文件,获取所述任一候选文件的文件名称和文件标题,并根据所述任一候选文件的文件类型,对所述任一候选文件的文本信息进行处理,以得到文本切片集合;A first processing module is used for, in response to receiving a file search request sent by a client, obtaining a file name and a file title of any candidate file among a plurality of candidate files associated with the client, and processing text information of the any candidate file according to a file type of the candidate file to obtain a text slice set; 生成模块,用于根据所述文本切片集合、所述文件名称和所述文件标题,生成第一切片向量集合;A generating module, configured to generate a first slice vector set according to the text slice set, the file name and the file title; 融合模块,用于获取登录所述客户端的目标对象的偏好向量和所述文件搜索请求中搜索语句中的搜索关键词的特征向量,并将所述偏好向量与所述特征向量进行融合,以得到会话向量;a fusion module, configured to obtain a preference vector of a target object logging into the client and a feature vector of a search keyword in a search statement in the file search request, and fuse the preference vector with the feature vector to obtain a session vector; 确定模块,用于根据所述会话向量和各所述候选文件的第一切片向量集合中各切片向量之间的第一相似度,从各所述候选文件中确定至少一个目标文件,将各所述目标文件发送至所述客户端;a determination module, configured to determine at least one target file from each of the candidate files according to a first similarity between the session vector and each slice vector in a first slice vector set of each of the candidate files, and send each of the target files to the client; 所述确定模块,具体用于:The determining module is specifically used for: 针对任意所述候选文件,对所述候选文件的第一切片向量中的多个切片向量进行特征降维,以得到第二切片向量集合,并根据所述候选文件的第一切片向量中的多个切片向量,对所述会话向量进行特征降维,以得到降维处理后的会话向量;For any of the candidate files, perform feature dimensionality reduction on multiple slice vectors in the first slice vector of the candidate file to obtain a second slice vector set, and perform feature dimensionality reduction on the session vector according to the multiple slice vectors in the first slice vector of the candidate file to obtain a session vector after dimensionality reduction processing; 根据所述第二切片向量集合中的各切片向量与所述降维处理后的会话向量之间的第二相似度,确定所述候选文件的第一切片向量集合中各切片向量对应的第一相似度;Determine, according to a second similarity between each slice vector in the second slice vector set and the session vector after the dimensionality reduction processing, a first similarity corresponding to each slice vector in the first slice vector set of the candidate file; 根据所述候选文件的第一切片向量集合中各切片向量对应的第一相似度,从所述候选文件的第二切片向量集合的多个切片向量中,确定至少一个第四切片向量;Determining at least one fourth slice vector from a plurality of slice vectors in the second slice vector set of the candidate file according to the first similarity corresponding to each slice vector in the first slice vector set of the candidate file; 对所述候选文件对应的各所述第四切片向量进行融合,以得到所述候选文件的文件向量;Merging the fourth slice vectors corresponding to the candidate file to obtain a file vector of the candidate file; 根据各所述候选文件的文件向量和所述会话向量之间的第三相似度,从各所述候选文件中确定所述目标文件,将各所述目标文件发送至客户端;Determine the target file from each of the candidate files according to a third similarity between the file vector of each of the candidate files and the session vector, and send each of the target files to the client; 所述确定模块,还具体用于:The determining module is further specifically used for: 针对任意所述候选文件,对所述候选文件的第一切片向量中的多个切片向量进行聚类,以得到多个向量聚簇;For any of the candidate files, clustering multiple slice vectors in the first slice vector of the candidate file to obtain multiple vector clusters; 根据各所述向量聚簇中切片向量的数量与设定抽样数量的占比,从各所述向量聚簇中,抽取设定样本向量数量的样本向量,以生成样本向量集;Extracting a set number of sample vectors from each of the vector clusters according to a ratio of the number of slice vectors in each of the vector clusters to a set number of sampling vectors, so as to generate a sample vector set; 根据所述样本向量集,对所述候选文件的第一切片向量中的多个切片向量进行特征降维,以得到第二切片向量集合。According to the sample vector set, feature dimension reduction is performed on multiple slice vectors in the first slice vector of the candidate file to obtain a second slice vector set. 16.一种电子设备,其特征在于,包括:16. An electronic device, comprising: 处理器;用于存储所述处理器可执行指令的存储器;A processor; a memory for storing instructions executable by the processor; 其中,所述处理器被配置为执行所述指令,以实现如权利要求1-14中任一项所述的搜索方法。The processor is configured to execute the instructions to implement the search method as described in any one of claims 1-14. 17.一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,所述计算机指令用于使计算机执行如权利要求1-14中任一项所述的搜索方法。17. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to enable a computer to execute the search method according to any one of claims 1 to 14.
CN202311224005.8A 2023-09-20 2023-09-20 Search method, device, electronic device and storage medium Active CN117331893B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311224005.8A CN117331893B (en) 2023-09-20 2023-09-20 Search method, device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311224005.8A CN117331893B (en) 2023-09-20 2023-09-20 Search method, device, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN117331893A CN117331893A (en) 2024-01-02
CN117331893B true CN117331893B (en) 2024-10-15

Family

ID=89291013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311224005.8A Active CN117331893B (en) 2023-09-20 2023-09-20 Search method, device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN117331893B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506864A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 File retrieval method and device, electronic equipment and readable storage medium
CN114356852A (en) * 2022-03-21 2022-04-15 展讯通信(天津)有限公司 File retrieval method, electronic equipment and storage medium
CN114996215A (en) * 2022-06-16 2022-09-02 中国联合网络通信集团有限公司 File searching method, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491518B (en) * 2017-08-15 2020-08-04 北京百度网讯科技有限公司 Search recall method and device, server and storage medium
CN111177551B (en) * 2019-12-27 2021-04-16 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for determining search result
CN115391479A (en) * 2021-05-19 2022-11-25 中移动信息技术有限公司 Sorting method, device, electronic medium and storage medium for document search
CN116662633A (en) * 2022-12-23 2023-08-29 百度(中国)有限公司 Search method, model training method, device, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506864A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 File retrieval method and device, electronic equipment and readable storage medium
CN114356852A (en) * 2022-03-21 2022-04-15 展讯通信(天津)有限公司 File retrieval method, electronic equipment and storage medium
CN114996215A (en) * 2022-06-16 2022-09-02 中国联合网络通信集团有限公司 File searching method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN117331893A (en) 2024-01-02

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN110162593B (en) Search result processing and similarity model training method and device
CN112119388B (en) Train image embedding models and text embedding models
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
KR101721338B1 (en) Search engine and implementation method thereof
CN107346336B (en) Information processing method and device based on artificial intelligence
EP2438539B1 (en) Co-selected image classification
US9589208B2 (en) Retrieval of similar images to a query image
JP5749279B2 (en) Join embedding for item association
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
WO2023108980A1 (en) Information push method and device based on text adversarial sample
US20230237093A1 (en) Video recommender system by knowledge based multi-modal graph neural networks
CN111557000B (en) Accuracy Determination for Media
CN112805715B (en) Identifying entity-attribute relationships
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN113806588B (en) Method and device for searching video
CN117494815B (en) File-oriented credible large language model training and reasoning method and device
CN117056575B (en) Method for data acquisition based on intelligent book recommendation system
CN114168841A (en) Content recommendation method and device
CN113672804B (en) Recommendation information generation method, system, computer device and storage medium
CN110413770B (en) Method and device for classifying group messages into group topics
CN117331893B (en) Search method, device, electronic device and storage medium
CN113688281B (en) Video recommendation method and system based on deep learning behavior sequence
CN113312523B (en) Dictionary generation and search keyword recommendation method and device and server
CN114722267A (en) Information push method, device and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant