CN111309984A

CN111309984A - Method and device for searching node vector from database by using index

Info

Publication number: CN111309984A
Application number: CN202010163626.XA
Authority: CN
Inventors: 杨文�; 李涛; 方概; 魏宏
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-03-10
Filing date: 2020-03-10
Publication date: 2020-06-19
Anticipated expiration: 2040-03-10
Also published as: CN111309984B

Abstract

The embodiment of the specification provides a method and a device for searching node vectors from a database by using indexes. The PostgreSQL database includes vectors of a plurality of nodes, and the index divides the plurality of nodes into a plurality of clusters, one center point for each cluster. During retrieval, vector matching is carried out on center points corresponding to the clusters and a first node to be retrieved respectively based on the index, a target center point which is most matched with the first node on the vector is determined from the center points, vector matching is carried out on the nodes in the first cluster where the target center point is located and the first node respectively, and each node is added into a matching queue according to a matching result and is sequenced based on the nodes in the matching queue, so that the node retrieved aiming at the first node is determined.

Description

Method and device for searching node vector from database by using index

Technical Field

One or more embodiments of the present disclosure relate to the field of data retrieval technologies, and in particular, to a method and an apparatus for node vector retrieval from a database using an index.

Background

With the development of computer technology, data contains more and more information, and the requirement for data retrieval is gradually increased. Data containing more information can be generally represented by high-dimensional vectors, and for example, data such as images and user features can be represented by high-dimensional vectors. In some application scenarios, there is a requirement to retrieve high-dimensional vectors. For example, when the face is swiped for payment, the input face image can be retrieved from a large number of face images in the database; on the shopping site, the input commodity image can be retrieved from a large number of commodity images in the database. PostgreSQL is a source database supporting vector retrieval, and has high availability and high expansibility. With the increase of the number of data and the increase of vector dimension, the vector retrieval efficiency based on the database becomes a key direction for the improvement of the current technology.

Therefore, an improved scheme is desired that can improve the search efficiency when performing high-dimensional vector search based on the PostgreSQL database.

Disclosure of Invention

One or more embodiments of the present specification describe a method and an apparatus for node vector retrieval from a database using indexes, so as to improve the retrieval efficiency when performing high-dimensional vector retrieval based on a PostgreSQL database. The specific calculation scheme is as follows.

In a first aspect, embodiments provide a method for node vector retrieval from a PostgreSQL database using indices, executed by a computer; the database comprises vectors of a plurality of nodes, and the index divides the plurality of nodes into a plurality of clusters, wherein each cluster corresponds to a central point; the method comprises the following steps:

acquiring a first node to be retrieved;

respectively carrying out vector matching on the central points corresponding to the clusters and the first node based on the index, and determining a target central point which is most matched with the first node on the vector from the central points;

vector matching is carried out on a plurality of nodes in a first cluster where the target center point is located and the first nodes respectively, and each node is added into a matching queue according to a matching result;

determining the retrieved node for the first node based on the ordering of nodes in the matching queue.

In one embodiment, the index includes a plurality of central point data pages and a plurality of node data pages belonging to different clusters, the central point data pages are used for storing vectors of respective central points and starting node data page identifiers of a cluster where each central point is located, the node data pages are used for storing vectors of respective nodes, and a node in one node data page corresponds to one cluster.

In one embodiment, the step of vector matching the central points corresponding to the plurality of clusters with the first node respectively based on the index includes:

acquiring a plurality of central point data pages from the index, acquiring vectors of central points corresponding to the clusters from the central point data pages, and respectively matching the vectors of the central points corresponding to the clusters with the vector of the first node;

the step of performing vector matching on the plurality of nodes in the first cluster where the target central point is located and the first node respectively includes:

and acquiring a starting node data page identifier of the first cluster from a central point data page corresponding to the target central point, acquiring vectors of a plurality of nodes corresponding to the first cluster from a plurality of node data pages of the index based on the starting node data page identifier, and respectively matching the vectors of the plurality of nodes with the vector of the first node.

In one embodiment, multiple pages of node data belonging to the same cluster are contiguous.

In one embodiment, the plurality of node data pages belonging to the same cluster are not contiguous; and the node data page is also used for storing the node data page identifier before the node data page and the node data page identifier after the node data page in the same cluster.

In one embodiment, the target center points are one or more;

the step of respectively performing vector matching on the plurality of nodes in the first cluster where the target center point is located and the first node, and adding each node into a matching queue according to a matching result includes:

and aiming at the first cluster where each target central point is located, vector matching is carried out on a plurality of nodes in the first cluster and the first nodes respectively, and the nodes in the first cluster where each target central point is located are added into the same matching queue according to the matching result.

In one embodiment, the database further comprises a first field other than the vector field of each node; when the first node to be retrieved is obtained, the method further comprises the following steps:

obtaining a restricted field value condition for the first field;

the step of determining the node retrieved for the first node based on the ordering of nodes in the matching queue comprises:

aiming at a first number of nodes with highest matching degree in a matching queue, acquiring a first field value of the first number of nodes from the database;

and screening out nodes meeting the condition of the limited field value from the first number of nodes based on each obtained first field value to obtain nodes retrieved aiming at the first nodes.

In a second aspect, an embodiment provides an index creation method for node vector retrieval from a PostgreSQL database, executed by a computer, the database comprising vectors of a plurality of nodes, the method comprising:

obtaining a plurality of nodes from the database;

performing node clustering based on vectors of at least part of the nodes in the plurality of nodes to obtain a plurality of clusters and respective central points of the plurality of clusters;

determining a cluster to which the plurality of nodes respectively belong;

and recording the central points of the clusters and the division of the clusters into the nodes by using the indexes for node vector retrieval.

In one embodiment, the step of clustering nodes based on vectors of at least some of the plurality of nodes includes:

sampling a first portion of nodes from the plurality of nodes;

performing first clustering on the first part of nodes based on the vector of each node in the first part of nodes to obtain a plurality of clusters and respective central points of the plurality of clusters;

the step of determining the cluster to which the plurality of nodes respectively belong at least comprises: and determining clusters corresponding to all nodes in the first part of nodes.

In one embodiment, the step of recording the center points of the plurality of clusters and the partitions of the plurality of clusters to the plurality of nodes by using the index at least includes:

generating a central point data page, and storing vectors of central points of a plurality of clusters to the central point data page;

and generating a node data page corresponding to each cluster, respectively storing the vector of each node in the first part of nodes into the node data page corresponding to the cluster, and storing the initial node data page identifier of each cluster into the corresponding central point data page.

In one embodiment, the step of generating a node data page corresponding to each cluster includes:

and generating a node data page of each cluster at least based on the number of nodes corresponding to each cluster, and storing a node data page identifier after the node data page in each cluster to the node data page aiming at any one node data page in each cluster.

In one embodiment, the plurality of nodes includes a second node that is any node other than the first portion of nodes; the step of determining the cluster to which the plurality of nodes belong further includes:

respectively carrying out vector matching on the second nodes and each central point, and determining a second cluster corresponding to the second nodes according to matching results;

the step of recording the center points of the plurality of clusters and the division of the plurality of nodes by the plurality of clusters using the index further includes: and when the existing node data page of the second cluster is not full, adding the node vector of the second node into the existing node data page of the second cluster.

In one embodiment, the step of recording the center points of the plurality of clusters and the partitions of the plurality of clusters into the plurality of nodes by using the index further includes:

and when the existing node data page of the second cluster is full, adding the node data page of the second cluster, storing the vector of the second node and the initial node data page identification of the second cluster into the added node data page, updating the added node data page into the initial node data page of the second cluster, and updating the initial node data page identification into the central point data page of the second cluster.

In a third aspect, the embodiment provides an apparatus for node vector retrieval from a PostgreSQL database using indexes, deployed in a computer; the database comprises vectors of a plurality of nodes, and the index divides the plurality of nodes into a plurality of clusters, wherein each cluster corresponds to a central point; the device comprises:

the retrieval acquisition module is configured to acquire a first node to be retrieved;

a central point retrieval module configured to perform vector matching on the central points corresponding to the plurality of clusters and the first node respectively based on the index, and determine a target central point which is most vector-matched with the first node from the plurality of central points;

the node retrieval module is configured to perform vector matching on a plurality of nodes in a first cluster where the target central point is located and the first nodes respectively, and add each node into a matching queue according to a matching result;

a node determination module configured to determine a node retrieved for the first node based on the ordering of nodes in the match queue.

In one embodiment, the central point retrieving module, when vector-matching the central points corresponding to the plurality of clusters with the first node respectively based on the index, includes:

the node retrieval module, when vector matching is performed on the plurality of nodes in the first cluster where the target central point is located and the first node, includes:

In one embodiment, the target center points are one or more; the node retrieval module is specifically configured to perform vector matching on the plurality of nodes in the first cluster and the first node respectively for the first cluster where each target center point is located, and add the nodes in the first cluster where each target center point is located into the same matching queue according to a matching result.

In one embodiment, the database further comprises a first field other than the vector field of each node; the retrieval obtaining module is further configured to obtain a limiting field value condition for the first field when obtaining a first node to be retrieved;

the node determination module is specifically configured to, for a first number of nodes with the highest matching degree in a matching queue, obtain a first field value of the first number of nodes from the database; and screening out nodes meeting the condition of the limited field value from the first number of nodes based on each obtained first field value to obtain nodes retrieved aiming at the first nodes.

In a fourth aspect, an embodiment provides an index creation apparatus for node vector retrieval from a PostgreSQL database deployed in a computer, the database comprising a vector of a plurality of nodes, the apparatus comprising:

a node acquisition module configured to acquire a plurality of nodes from the database;

the node clustering module is configured to perform node clustering based on vectors of at least part of the nodes in the plurality of nodes to obtain a plurality of clusters and respective central points of the plurality of clusters;

a node attribution module configured to determine a cluster to which each of the plurality of nodes belongs;

and the index recording module is configured to record the central points of the clusters and the division of the clusters into the nodes by using the indexes for node vector retrieval.

In one embodiment, the node clustering module, when performing node clustering based on vectors of at least some of the plurality of nodes, includes:

sampling a first portion of nodes from the plurality of nodes;

the node attribution module is at least configured to determine a cluster corresponding to each node in the first part of nodes.

In one embodiment, the index recording module is at least configured to:

In an embodiment, the generating the node data page corresponding to each cluster by the index recording module includes generating the node data page of each cluster at least based on the number of nodes corresponding to each cluster, and storing, for any one node data page in each cluster, a node data page identifier after the node data page in the cluster to the node data page.

In one embodiment, the plurality of nodes includes a second node that is any node other than the first portion of nodes; the node attribution module is further configured to:

the index recording module is further configured to add the node vector of the second node to the existing node data page of the second cluster when the existing node data page of the second cluster is not full.

In one embodiment, the index recording module is further configured to:

In a fifth aspect, embodiments provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of the first to second aspects.

In a sixth aspect, an embodiment provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method of any one of the first aspect to the second aspect.

In the method and apparatus for performing node vector retrieval from a database by using an index provided in an embodiment of the present specification, the index divides a plurality of nodes in the database into a plurality of clusters, each cluster corresponds to a central point, when performing node retrieval, vector matching is performed on the central points of the plurality of clusters and a first node, a first cluster more matched with the first node is determined from the plurality of clusters according to a matching result, and a node group more similar to the first node can be determined in a large direction. And then matching the plurality of nodes of the first cluster with the first node respectively, and determining the nodes searched for the first node from the plurality of nodes corresponding to the first cluster according to the matching result, so that the searching range can be quickly narrowed, the matching times can be reduced, and the searching efficiency when high-dimensional vector searching is carried out based on a PostgreSQL database can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram of a database and index structure disclosed in the present specification;

FIG. 2 is a flowchart illustrating an index creation method according to an embodiment;

FIG. 3 is another schematic flow chart of a method for creating an index based on the method of FIG. 2;

FIG. 4 is a schematic diagram of an exemplary format of a central point data page and a node data page;

FIG. 5 is a flowchart illustrating a method for node vector search using indexes according to an embodiment;

FIG. 6 is a schematic block diagram of an apparatus for node vector retrieval from a PostgreSQL database using indexes according to an embodiment;

fig. 7 is a schematic block diagram of an index creation apparatus for node vector retrieval from a PostgreSQL database according to an embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

PostgreSQL is a source database supporting vector retrieval, and has high availability and high expansibility. The database may be used to store data. For example, the database may store images, user data, or behavioral event data, among others. Each piece of data in the database may be referred to as a node, and the data of the node may include a plurality of fields, for example, a vector field for representing a feature of the node, and for the user data, a region field, an age field, and the like; the face image can also comprise a name field and an age field in the face image. Wherein the vector of nodes may be composed of multidimensional data. For example, the vector corresponding to the image may be multidimensional data in which each pixel in the image is used as a vector; the vector corresponding to the user data may be obtained based on some calculation performed on the user information. The vector field is an important feature of the node, and is also an important search field. When the dimension of the vector itself is very high, the vector is also referred to as a high-dimensional vector.

The PostgreSQL database itself supports the retrieval of high-dimensional vectors. Vector-based retrieval is the process of finding one or more vectors from a database that best match the vector to be retrieved. For example, the database may be used to store various merchandise information sold by merchants, the merchandise information may include images of the merchandise and other information, and the images themselves may be stored in the form of high-dimensional vectors. When the client receives an article image input by the user, the client can search the database for the commodity information similar to the article image, and the article image can be matched with the image in each commodity information of the database during searching. When the image resolution is high or very high, a one-to-one match between a large number of high-dimensional vectors will be very time consuming.

In order to improve the retrieval efficiency when performing high-dimensional vector retrieval based on a PostgreSQL database, embodiments of the present specification provide a node vector retrieval method, which performs retrieval based on a pre-constructed index. The index may partition a plurality of nodes in the database into a plurality of clusters, one for each center point. During retrieval, vector matching is carried out on the central points of a plurality of clusters in the index and the nodes to be retrieved respectively, the clusters which are closer to the nodes to be retrieved are determined from the clusters, and then the nodes in the clusters are subjected to vector matching with the nodes to be retrieved respectively. The process described above comprises two phases, namely an index construction phase and a vector retrieval phase. The index build phase is first described below.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a PostgreSQL database and an index for node vector retrieval from the database in an embodiment of the present specification. The left database comprises a plurality of nodes, and the nodes are represented by black dots. The right side is an index of the database, the index divides a plurality of nodes in the database into a plurality of clusters, each cluster has a central point, and the central points are represented by hollow circles.

Fig. 2 is a flowchart illustrating an index creating method according to an embodiment. The method may be performed by a computer. In particular, the method may be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. The creation method includes the following steps S210 to S240.

Step S210, a plurality of nodes are obtained from the database. The plurality of nodes acquired may be, but is not limited to, all nodes in the database. Obtaining a plurality of nodes may be understood as obtaining vectors of the plurality of nodes. For example, a vector of n nodes is obtained from the database, where n is a positive integer, and the obtained n nodes can be represented by X1, X2, X3, … …, and Xn.

Step S220, performing node clustering based on vectors of at least some nodes in the plurality of nodes to obtain respective central points of the plurality of clusters and the plurality of clusters. Step S230, determining a cluster to which each of the plurality of nodes belongs.

In clustering, clustering may be performed based on all nodes in the database, or may be performed based on some nodes in the database. When clustering is performed, various clustering algorithms can be used. For example, k-means clustering algorithms may be employed for clustering. The implementation of step S220 is described below with the k-means clustering algorithm as an example.

Specifically, the nodes S1, S2, S3, … … and Sm in the plurality of nodes S1, S2, S3 and … … and Sn can be clustered, wherein m is a positive integer and is less than or equal to n. During clustering, firstly, the number of clusters is set, namely the number k of the clusters is a positive integer, and k is less than m. Randomly selecting k nodes from the nodes S1, S2, S3, … … and Sm, respectively serving as the center points of k clusters, respectively calculating the vector matching degree between each node and the center points of the k clusters, determining the node as the cluster corresponding to the most matched center point, recalculating the center points of the k clusters based on the vectors of the nodes divided into the k clusters after determining the cluster corresponding to each node, returning to continuously execute the step of respectively calculating the vector matching degree between each node and the center points of the k clusters, and iteratively and continuously updating the center points of the k clusters. And when the vectors of the central points of the k clusters are not changed any more or the iteration times reach preset times, determining that the clustering is finished.

When the center points of the k clusters are recalculated based on the vectors of the nodes divided into the k clusters, the mean value of the vectors of the nodes divided into the clusters can be determined as the center point of each cluster for each cluster of the k clusters. Therefore, after the clustering is completed, the center point of the cluster may be one of the nodes divided into the cluster, or may be a point calculated based on the nodes divided into the cluster.

When calculating the vector matching degree between each node and the center point of k clusters, the matching degree between the vector of each node and the vector of the center point of k clusters may be calculated, and when determining the matching degree between the vectors, algorithms such as Pearson Correlation Coefficient, Euclidean Distance (Euclidean Distance), or cosine similarity may be used.

When the clustering in step S220 is completed, vectors of center points of the k clusters may be determined, and clusters to which respective nodes participating in the clustering belong may be determined. In step S230, the cluster to which each of the plurality of nodes belongs is determined, which may be at least understood as the cluster to which each of the nodes participating in the clustering belongs being obtained from the result of step S220. There is similarity in vectors between multiple nodes belonging to the same cluster.

Step S240, recording central points of the plurality of clusters and partitions of the plurality of clusters into a plurality of nodes by using the index for node vector retrieval. In this step, the center points of the k clusters and the division of the k clusters into n or m nodes may be recorded in various ways, and the recorded content is used as the index content. The contents of the index may be used to perform a node vector search.

As can be seen from the above, in the creating of the index in this embodiment, a plurality of nodes in the database are divided into a plurality of clusters through a clustering algorithm, and a central point corresponding to each cluster and a cluster to which each node belongs are determined. Therefore, the nodes in the database are divided into a plurality of clusters, when node retrieval is carried out, the central points of the clusters can be respectively subjected to vector matching with the nodes to be retrieved, the clusters which are more matched with the nodes to be retrieved are determined from the clusters according to matching results, the node groups which are more similar to the nodes to be retrieved can be determined in the large direction, and then the nodes of the clusters are respectively matched with the nodes to be retrieved, so that the retrieval range can be quickly reduced, the matching times are reduced, and the retrieval efficiency when high-dimensional vector retrieval is carried out based on the PostgreSQL database can be improved.

Referring back to the above steps S210 to S240, when the number of nodes included in the database is very large, some of the n nodes obtained from the database may be clustered. Referring to a flow diagram shown in fig. 3, m nodes sampled from the acquired n nodes are clustered to obtain a cluster 1, a cluster 2, a cluster k, and the like, and a corresponding central point 1, a central point 2, a central point k, and the like. And then, respectively carrying out vector matching on the remaining n-m nodes in the n nodes and each central point, and determining the cluster to which the n-m nodes belong. Each central point is stored in a central point data page, and nodes belonging to each cluster are respectively stored in a node data page corresponding to each cluster.

In this embodiment, in step S220, a first part of nodes may be sampled from the multiple nodes, and a first clustering is performed on the first part of nodes based on vectors of each node in the first part of nodes, so as to obtain respective central points of the multiple clusters and the multiple clusters. The first clustering may be performed using the k-means clustering algorithm described above.

For example, m nodes are sampled from n nodes to serve as a first partial node, and vectors of the m nodes are subjected to first clustering to obtain k clusters and vectors of respective center points of the k clusters. Where m < n, m can be much smaller than n even when the number of n is large. For example, where n is a billion, m can be taken to be ten million.

Step S220 performs clustering, including determining k clusters and center points of the k clusters. M nodes are sampled from the n nodes, and clustering is performed based on the m nodes, so that k clusters and central points of the k clusters can be completely determined, and the clustering calculation efficiency can be improved to a greater extent.

In step S230, when determining the cluster to which each of the plurality of nodes belongs, the cluster corresponding to each node in the first part of nodes may be determined first. For example, which cluster of the k clusters the m nodes respectively belong to may be determined based on the clustering result of step S220. Generally, a node can belong to only one cluster.

When the central points of the plurality of clusters and the clusters corresponding to the nodes in the first part of nodes are determined, the central points and the nodes can be recorded by using a data page provided by a PostgreSQL database. Step S240, when the index is used to record the central points of the multiple clusters and the division of the multiple clusters into multiple nodes, a central point data page may be specifically generated, and the vectors of the central points of the multiple clusters are stored in the central point data page; and generating a node data page corresponding to each cluster, respectively storing the vector of each node in the first part of nodes into the node data page of the corresponding cluster, and storing the initial node data page identification of each cluster into the corresponding central point data page.

When generating the node data page corresponding to each cluster, a plurality of node data pages corresponding to each cluster may be generated based on the number of nodes corresponding to each cluster, the data amount of each node vector, and the capacity of each node data page, and for any one node data page in each cluster, the node data page identifier following the node data page in the cluster is stored to the node data page. The node data page identifier that is prior to the node data page may also be stored in the node data page.

See, for example, the second row in the data page schematic shown in fig. 4. The cluster 1 corresponds to a node data page 3, a node data page 5, a node data page 6, and the like. In each node data page, a storage area between a header area and a footer area is used for storing the vector of the node. The page tail area of the node data page 3 stores the identifier p5 of the node data page 5, the page head area of the node data page 5 stores the identifier p3 of the node data page 9, the page tail area stores the identifier p6 of the node data page 6, and the page head area of the node data page 6 stores the identifier p5 of the node data page 5. Thus, each node data page in the cluster 1 stores the previous node data page identifier and the next node data page identifier, and the node data pages corresponding to the cluster 1 are connected into a chain. When a plurality of nodes in the cluster are acquired, other adjacent node data pages can be conveniently found based on the identification information stored in each node data page.

Wherein the capacity of each node data page may be the same. The node data page identification may be a number or other serial number of the node data page. The node data pages corresponding to each cluster may be continuous or discontinuous.

The center point data page may be identical in format and capacity to the node data page. In general, since the total number of center points is much smaller than the total number of nodes in the database, the total number of center point data pages is much smaller than the total number of node data pages in terms of the number of pages.

In generating the center point data page, the number of center point data pages required may be determined according to the data amount of the center point vector and the capacity of each center point data page, and the center point data page may be generated based on the number. For each central point data page, the central point data page identifier after the central point data page may be stored in the central point data page, or the central point data page identifier before the central point data page may be stored in the central point data page.

For example, for the first row in the data page diagram in fig. 4. The page tail region of the center point data page 0 stores the identifier p1 of the center point data page 1, the page head region of the center point data page 1 stores the identifier p0 of the center point data page 0, the page tail region stores the identifier p2 of the center point data page 2, and the page head region of the center point data page 2 stores the identifier p1 of the center point data page 1. Therefore, when a plurality of central points are obtained, other adjacent central point data pages can be conveniently found out based on the identification information stored in each central point data page.

In addition, the central point data page of fig. 4 also stores the start node data page identification of each cluster. For example, in the center point data page 0, the center point 1 and the corresponding start node data page identification p3 of the cluster 1, and the center point 2 and the corresponding start node data page identification p9 of the cluster 2 are stored. In this way, each center point can be associated with a corresponding cluster. In addition, each central point data page identification can also be stored in a designated area for rapidly acquiring the vector of each central point during retrieval.

Through the above manner, the center points corresponding to the plurality of clusters and the nodes corresponding to the plurality of clusters can be recorded in the index.

Returning to the schematic diagram shown in fig. 3, after dividing n nodes into m nodes and n-m nodes, and after completing clustering and completing the attribution records of the respective central points and m nodes, a cluster to which the n-m nodes belong may also be determined.

For any second node X2 in the n-m nodes, vector matching can be further performed on the second node X2 and each central point, and the second cluster C2 corresponding to the second node X2 is determined according to the matching result. When the existing node data page of the second cluster C2 is not full, the node vector of the second node X2 is added to the existing node data page of the second cluster C2. The existing node data page can be understood as an already generated and existing node data page.

When the existing node data page of the second cluster C2 is full, the node data page of the second cluster C2 is added, the vector of the second node X2 and the start node data page identifier of the second cluster C2 are stored in the added node data page, the added node data page is updated to the start node data page of the second cluster C2, and the start node data page identifier is updated to the center point data page of the second cluster C2. That is, the newly added node data page is inserted into the head of the existing node data page as a new starting node data page.

After storing each center point to the center point data page and each node to the node data page of the corresponding cluster, the center point data page and the node data page may be saved in the disk.

After the index is constructed through the above embodiments, the node vector retrieval process may be performed based on the index. An embodiment of the vector retrieval phase is described below.

Fig. 5 is a flowchart illustrating a method for retrieving a node vector from a PostgreSQL database by using an index according to an embodiment. The method may be performed by a computer, and in particular, the method may be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. Wherein the index is created using the method shown in fig. 2. The index divides the plurality of nodes into a plurality of clusters, each cluster corresponding to a center point. The search method includes the following steps S510 to S540.

In step S510, a first node X1 to be retrieved is obtained. The first node X1 to be retrieved can be understood as the node that needs to be retrieved from the database that is similar to it in a vector. The acquired first node X1 is an arbitrary node.

Step S520, based on the index, vector-matching the center points corresponding to the plurality of clusters with the first node X1, respectively, and determining a target center point Xm that is most vector-matched with the first node X1 from the plurality of center points. The target center point Xm may be one or more. When the target center point Xm is determined, each center point may also be added to a center point matching queue according to a matching result, and one or more most matched target center points Xm are determined from the center point matching queue.

Step S530, vector matching is carried out on a plurality of nodes in the first cluster C1 where the target center point Xm is located and the first node X1, and each node is added into a matching queue according to a matching result.

When the number of the target center points Xm is multiple, for the first cluster C1 where each target center point Xm is located, vector matching is performed on multiple nodes in the first cluster C1 and the first node X1, and the nodes in the first cluster C1 where each target center point Xm is located are added into the same matching queue according to the matching result. That is, according to the matching result, the nodes in different clusters are added into the same matching queue, which is more beneficial to comparing the matching degree between the nodes in different clusters and the first node X1.

In step S540, the node retrieved for the first node X1 is determined based on the node ordering in the matching queue. In this step, the previous preset number of nodes may be determined as nodes retrieved for the first node X1 from among the nodes in the matching queue arranged from large to small according to the matching degree, as a retrieval result.

As can be seen from the above, in the embodiment, when node retrieval is performed, vector matching is performed on the center points of the multiple clusters and the first node, and the first cluster more matched with the first node is determined from the multiple clusters according to the matching result, so that a node group more similar to the first node can be determined in a large direction. And then matching the plurality of nodes of the first cluster with the first node respectively, and determining the nodes searched for the first node from the plurality of nodes corresponding to the first cluster according to the matching result, so that the searching range can be quickly narrowed, the matching times can be reduced, and the searching efficiency when high-dimensional vector searching is carried out based on a PostgreSQL database can be improved.

In an embodiment, for convenience of implementation and to improve efficiency in retrieval, the index may use the center point data page and the node data page to record the center point and the node, respectively. Specifically, the index may include a plurality of center point data pages and a plurality of node data pages belonging to different clusters. And the central point data page is used for storing the vectors of the central points and the initial node data page identification of the cluster where each central point is located. And the node data page is used for storing the vector of each node. A node in a node data page corresponds to a cluster. The format of the various data pages can be seen in the schematic diagram shown in fig. 4.

The data pages of the plurality of nodes belonging to the same cluster may be continuous or discontinuous. When the plurality of node data pages belonging to the same cluster are not continuous, the node data page is also used for storing a node data page identifier before the node data page and a node data page identifier after the node data page in the same cluster.

In step S520, when vector matching is performed on the center points corresponding to the plurality of clusters with the first node X1 based on the index, specifically, a plurality of center point data pages may be obtained from the index, vectors of the center points corresponding to the plurality of clusters may be obtained from the plurality of center point data pages, and the vectors of the center points corresponding to the plurality of clusters may be matched with the vector of the first node X1.

In step S530, when vector matching is performed between the plurality of nodes in the first cluster C1 where the target center point Xm is located and the first node C1, specifically, a start node data page identifier of the first cluster C1 may be obtained from the center point data page corresponding to the target center point, based on the start node data page identifier, vectors of the plurality of nodes corresponding to the first cluster C1 are obtained from the plurality of indexed node data pages, and the vectors of the plurality of nodes are respectively matched with the vector of the first node.

In the actual retrieval, the retrieval condition may also include retrieval for fields other than the vector field. The embodiment of the specification also provides a retrieval method for carrying out retrieval by combining the vector field and other fields.

The database may further include a first field other than the vector field of each node, for example, the first field may be a non-vector field such as a region field, an age field, or a year field, and these fields may also be text fields. When the first node X1 to be retrieved is acquired at step S510, a restriction field value condition for the first field X1 may also be acquired. The restricted field condition may include that the value of the first field is within a certain range or that the value of the first field is equal to a preset certain value. For example, where the first field is a city, the field value may be limited to Beijing; where the first field is year, the field value may be limited to 2015-2020.

In step S540, when the retrieved node for the first node X1 is determined based on the node ranking in the matching queue, for a first number of nodes with the highest matching degree in the matching queue, first field values of the first number of nodes may be obtained from the database, and based on each obtained first field value, a node satisfying a condition of a restricted field value is screened from the first number of nodes, so as to obtain a node retrieved for the first node X1. It can be seen that the embodiment can not only search for the vector field alone, but also search for the combination of the vector field and other text fields, thereby enriching the search function.

The foregoing describes certain embodiments of the present specification, and other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily have to be in the particular order shown or in sequential order to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Fig. 6 is a schematic block diagram of an apparatus for node vector retrieval from a PostgreSQL database using indexes according to an embodiment. The apparatus 600 is deployed in a computer, and the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 5. The database comprises vectors of a plurality of nodes, and the index divides the plurality of nodes into a plurality of clusters, wherein each cluster corresponds to a central point. The apparatus 600 comprises:

a retrieval obtaining module 610 configured to obtain a first node to be retrieved;

a central point retrieving module 620, configured to perform vector matching on the central points corresponding to the plurality of clusters and the first node respectively based on the index, and determine a target central point which is most matched with the first node on the vector from the plurality of central points;

the node retrieval module 630 is configured to perform vector matching on a plurality of nodes in a first cluster where the target center point is located and the first node, and add each node into a matching queue according to a matching result;

a node determination module 640 configured to determine the retrieved node for the first node based on the ordering of the nodes in the match queue.

In a specific embodiment, the index may include a plurality of central point data pages and a plurality of node data pages belonging to different clusters, the central point data pages are used for storing vectors of respective central points and start node data page identifiers of a cluster where each central point is located, the node data pages are used for storing vectors of respective nodes, and a node in one node data page corresponds to one cluster.

In a specific embodiment, the center point retrieving module 620, when vector-matching center points corresponding to a plurality of clusters with the first node respectively based on the index, includes:

acquiring a plurality of central point data pages from the index, acquiring vectors of central points corresponding to a plurality of clusters from the plurality of central point data pages, and respectively matching the vectors of the central points corresponding to the plurality of clusters with the vector of the first node;

the node retrieving module 630, when vector matching is performed between the plurality of nodes in the first cluster where the target center point is located and the first node, includes:

and acquiring an initial node data page identifier of the first cluster from a central point data page corresponding to the target central point, acquiring vectors of a plurality of nodes corresponding to the first cluster from a plurality of indexed node data pages based on the initial node data page identifier, and respectively matching the vectors of the plurality of nodes with the vectors of the first node.

In one embodiment, the target center points are one or more; the node retrieving module 630 is specifically configured to:

and aiming at the first cluster where each target center point is located, vector matching is carried out on a plurality of nodes in the first cluster and the first nodes respectively, and the nodes in the first cluster where each target center point is located are added into the same matching queue according to the matching result.

In one embodiment, the database further comprises a first field other than the vector field of each node; the retrieval obtaining module 610 is further configured to, when obtaining a first node to be retrieved, obtain a restricted field value condition for a first field;

the node determining module 640 is specifically configured to:

aiming at a first number of nodes with the highest matching degree in a matching queue, acquiring a first field value of the first number of nodes from a database;

and screening out nodes meeting the condition of limiting field values from the first number of nodes based on the acquired first field values to obtain the nodes retrieved aiming at the first nodes.

Fig. 7 is a schematic block diagram of an index creation apparatus for node vector retrieval from a PostgreSQL database according to an embodiment. The apparatus 700 is deployed in a computer, and the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2. Wherein the database comprises vectors for a plurality of nodes. The apparatus 700 comprises:

a node obtaining module 710 configured to obtain a plurality of nodes from a database;

a node clustering module 720 configured to perform node clustering based on vectors of at least some nodes in the plurality of nodes to obtain respective center points of the plurality of clusters;

a node attribution module 730 configured to determine a cluster to which each of the plurality of nodes belongs;

the index recording module 740 is configured to record the center points of the plurality of clusters and the division of the plurality of nodes by the plurality of clusters by the index for node vector retrieval.

In one embodiment, the node clustering module 720, when performing node clustering based on vectors of at least some of the plurality of nodes, includes:

sampling a first part of nodes from a plurality of nodes;

the node attribution module 730 is configured to determine, at least, clusters corresponding to the nodes in the first part of nodes.

In one embodiment, the index recording module 740 is configured to at least:

and generating a node data page corresponding to each cluster, respectively storing the vector of each node in the first part of nodes into the node data page of the corresponding cluster, and storing the initial node data page identification of each cluster into the corresponding central point data page.

In a specific embodiment, the index recording module 740, when generating the node data page corresponding to each cluster, includes:

In one embodiment, the plurality of nodes includes a second node that is any node other than the first node; a node attribution module 730, further configured to:

the index recording module 740 is further configured to add the node vector of the second node to the existing node data page of the second cluster when the existing node data page of the second cluster is not full.

In a specific embodiment, the index recording module 740 is further configured to:

The above device embodiments correspond to the method embodiments, and for specific description, reference may be made to the description of the method embodiments, which is not described herein again. The device embodiment is obtained based on the corresponding method embodiment, has the same technical effect as the corresponding method embodiment, and for the specific description, reference may be made to the corresponding method embodiment.

Embodiments of the present specification also provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of fig. 1 to 5.

The present specification also provides a computing device comprising a memory and a processor, wherein the memory stores executable code, and the processor executes the executable code to implement the method of any one of fig. 1 to 5.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the storage medium and the computing device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments further describe the objects, technical solutions and advantages of the embodiments of the present invention in detail. It should be understood that the above description is only exemplary of the embodiments of the present invention, and is not intended to limit the scope of the present invention, and any modification, equivalent replacement, or improvement made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method for node vector retrieval from a PostgreSQL database using indices, performed by a computer; the database comprises vectors of a plurality of nodes, and the index divides the plurality of nodes into a plurality of clusters, wherein each cluster corresponds to a central point; the method comprises the following steps:

acquiring a first node to be retrieved;

2. The method of claim 1, wherein the index comprises a plurality of central point data pages and a plurality of node data pages belonging to different clusters, the central point data pages are used for storing vectors of respective central points and starting node data page identifications of clusters where each central point is located, the node data pages are used for storing vectors of respective nodes, and a node in one node data page corresponds to one cluster.

3. The method of claim 2, wherein the step of vector matching the center points corresponding to the plurality of clusters with the first node based on the index comprises:

4. The method of claim 2, wherein a plurality of node data pages belonging to the same cluster are contiguous.

5. The method of claim 2, wherein the plurality of node data pages belonging to the same cluster are not contiguous; and the node data page is also used for storing the node data page identifier before the node data page and the node data page identifier after the node data page in the same cluster.

6. The method of claim 1, the target center points being one or more;

7. The method of claim 1, the database further comprising a first field outside of the vector field of each node; when the first node to be retrieved is obtained, the method further comprises the following steps:

obtaining a restricted field value condition for the first field;

8. An index creation method for node vector retrieval from a PostgreSQL database, executed by a computer, the database comprising a vector of a plurality of nodes, the method comprising:

obtaining a plurality of nodes from the database;

determining a cluster to which the plurality of nodes respectively belong;

9. The method of claim 8, the step of clustering nodes based on vectors of at least some of the plurality of nodes, comprising:

sampling a first portion of nodes from the plurality of nodes;

10. The method of claim 9, wherein the step of recording the center points of the plurality of clusters and the partitions of the plurality of clusters into the plurality of nodes using the index comprises at least:

11. The method of claim 10, the step of generating a node data page corresponding to each cluster comprising:

12. The method of claim 10, wherein the plurality of nodes includes a second node that is any node other than the first portion of nodes; the step of determining the cluster to which the plurality of nodes belong further includes:

13. The method of claim 12, the step of recording center points of the plurality of clusters and partitions of the plurality of clusters into the plurality of nodes using the index, further comprising:

14. An apparatus for node vector retrieval from a PostgreSQL database using indices, deployed in a computer; the database comprises vectors of a plurality of nodes, and the index divides the plurality of nodes into a plurality of clusters, wherein each cluster corresponds to a central point; the device comprises:

15. The apparatus of claim 14, the index comprising a plurality of center point data pages and a plurality of node data pages belonging to different clusters, the center point data pages storing vectors of respective center points and start node data page identifications of clusters in which each center point is located, the node data pages storing vectors of respective nodes, a node in a node data page corresponding to a cluster.

16. The apparatus of claim 15, wherein the center point retrieving module, when vector-matching the center points corresponding to the plurality of clusters with the first node respectively based on the index, comprises:

17. The apparatus of claim 14, the target center points are one or more; the node retrieval module is specifically configured to:

18. The apparatus of claim 14, the database further comprising a first field outside of the vector field of each node; the retrieval obtaining module is further configured to obtain a limiting field value condition for the first field when obtaining a first node to be retrieved;

the node determination module is specifically configured to:

19. An index creation apparatus for node vector retrieval from a PostgreSQL database deployed in a computer, the database comprising a vector of a plurality of nodes, the apparatus comprising:

20. The apparatus of claim 19, the node clustering module, when performing node clustering based on vectors of at least some of the plurality of nodes, comprises:

sampling a first portion of nodes from the plurality of nodes;

21. The apparatus of claim 20, the index recording module configured to at least:

22. The apparatus of claim 21, wherein the index recording module, when generating the node data page corresponding to each cluster, comprises:

23. The apparatus of claim 21, the plurality of nodes comprising a second node that is any node other than the first portion of nodes; the node attribution module is further configured to:

24. The apparatus of claim 23, the index recording module further configured to:

25. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-13.

26. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-13.