Disclosure of Invention
The invention aims to provide a graph data compression method for a graph computation accelerator and the graph computation accelerator, which can improve the parallelism and the energy efficiency of the graph computation accelerator.
In order to achieve the purpose, the invention adopts the main technical scheme that:
the invention provides a graph data compression method for a graph computation accelerator, which comprises the following steps:
s1, a preprocessing circuit of a graph computation accelerator converts graph data to be processed and represented by an adjacent sparse matrix into graph data in an independent sparse column compression CSCI format, each column of the independently compressed graph data comprises a column identification data pair and a non-zero element data pair, each data pair comprises an index and a value, the highest two bits of the index indicate the rest bits of the index and the value,
s2, the preprocessing circuit of the graph calculation accelerator stores the converted graph data in the CSCI format in a memory of the graph calculation accelerator.
As a further improvement of the present invention, the step S1 includes:
independently compressing the graph data represented by the sparse adjacency matrix into individual data pairs according to columns;
each data pair structure includes: index and value;
the data pair with the highest two index bits of "01" or "10" is a column identifier ioc;
the data pairs following the data pairs identified as columns are the data pairs corresponding to non-zero elements of all rows of the column.
As a further improvement of the present invention, when the highest two bits of the index are "01", the rest bits of the index represent the column index, and value represents the number of non-zero elements of the column in the adjacent sparse matrix;
when the highest two bits of the index are '10', the rest bits of the index represent a column index, the column is the last column of the adjacent sparse matrix, and value represents the number of nonzero elements of the column in the adjacent sparse matrix;
when the highest two bits of the index are '00', the rest bits of the index represent row indexes, and value represents a corresponding non-zero element value in the sparse adjacent matrix.
As a further improvement of the present invention, the number of bits of the index and value is determined according to the data amount of the adjacent sparse matrix data.
In another aspect, the present invention also provides a graph computation accelerator, comprising a preprocessing circuit and a memory;
the preprocessing circuit performs a conversion process on the neighboring sparse matrix data according to the compression method of any one of the preceding claims 1 to 4.
As a further improvement of the invention, the method also comprises the following steps:
the device comprises a control circuit, a data access unit, a scheduler, a mixed granularity processing unit and a result generation unit;
the preprocessing circuit is further used for storing the column identification copy in the CSCI into the memory;
the control circuit is used for receiving a conversion ready indication signal sent by the preprocessing circuit after the preprocessing circuit finishes storing in the memory, calculating an application type according to a graph sent by the host, controlling the operation of the data access unit, the mixed granularity processing unit and the result generating unit, and sending a root vertex index of an application type I or a source vertex index of an application type II sent by the host to the data access unit;
the data access unit is used for reading the graph data and the column identification of the CSCI from the memory, calculating the physical address of the appointed vertex in the memory according to the root vertex index, the source vertex index or the active vertex index transmitted by the result generation unit so as to access the data, and transmitting the read data to the scheduler;
the scheduler is used for temporarily storing the number of the non-zero elements indicated by the column identifiers in the CISI and distributing the temporarily stored data to the processing elements in the mixed granularity processing unit for processing according to the state signals of the processing elements in the mixed granularity processing unit;
the mixed granularity processing unit is used for carrying out parallel processing on the data temporarily stored in the scheduler according to the application type in the control circuit and the active vertex data of the result generating unit and transmitting the processed intermediate data to the result generating unit;
and the result generating unit is used for processing the intermediate data according to the application type in the control circuit, sending the active vertex index of the processing process to the data access unit and storing the processed final result.
As a further improvement of the present invention, the control circuit includes: a host interface component and a control logic component;
the host interface component is used for receiving an application type, a root vertex index of the application type I and a source vertex index of the application type II which are sent by a host;
the control logic component is used for receiving a conversion ready indication signal sent by the preprocessing circuit, sending the root vertex index or the source vertex index to the data access unit, sending the application type to the mixed granularity processing unit and the result generating unit, and starting each module in the graph computation accelerator to start working;
the application type is a breadth-first search application BFS type, and the application type is a single-source shortest path application SSSP type.
As a further improvement of the present invention, the data access unit includes: the system comprises a user logic component, an address calculation module and a row identification temporary register;
the row identification temporary storage is used for storing row identifications of the graph data in the CSCI;
the address calculation module is used for calculating the physical address of the data corresponding to the current active vertex i in the memory according to the vertex index sent by the control circuit and input by the result generation unit and by combining the number of each row of non-zero element data and each row of storage data in the row identification temporary storage;
the user logic component is used for reading the column identification from the memory and temporarily storing the column identification in the column identification temporary storage; reading data corresponding to the corresponding active vertex from the memory according to the address calculated by the address calculation module, and sending the read data to the scheduler;
after receiving the pause reading signal sent by the scheduler, stopping reading the data from the memory;
the user logic component is also used for reading data again after the pause reading signal sent by the scheduler fails.
As a further improvement of the present invention, the scheduler comprises: the system comprises a buffer area distribution module, a task scheduling module and a double-buffer area module;
the buffer area distribution module is used for analyzing a data pair corresponding to the column identifier of the column data transmitted from the data access unit, transmitting the graph data corresponding to the column identifier to the double-buffer area module according to the buffer area state information transmitted by the double-buffer area module, and transmitting a reading stop signal to the data access unit when all the buffer areas in the double-buffer area module are occupied;
the task scheduling module is used for sending unscheduled data in all the buffers into a processing element which is idle and has the calculation capacity meeting the requirement for processing according to the processing element state signal transmitted by the mixed granularity processing unit and the buffer state information transmitted by the double-buffer module;
the double-buffer module comprises: a plurality of groups of front and back double buffer areas with different temporary storage capacities;
the double-buffer module is used for informing the task scheduling module to schedule the graph data cached by the buffer when the states of all the buffers are full, setting the state of the buffer after the data scheduling to be empty, and sending the state of the buffer to the buffer allocation module.
As a further improvement of the invention, the mixed particle size processing unit comprises: an auxiliary circuit module and a processing element array;
the auxiliary circuit module is used for transmitting the active vertex data pair input by the result generating unit and the corresponding CSCI input by the scheduler to the corresponding idle processing element in the processing element array according to the state of each processing element in the processing element array;
the processing element array consists of a plurality of processing elements PE with different capacities, and the plurality of processing elements work in parallel;
and after each processing element receives the active vertex data pair and the CSCI input by the auxiliary circuit module, calculating the active vertex data pair and the CSCI according to the application type transmitted by the control circuit.
As a further improvement of the present invention, each processing element is specifically configured to:
when the application type transmitted by the control circuit is breadth-first search, adding 1 to the value of each data pair in the CSCI by the value of the active vertex data pair;
when the application type is the single-source shortest path, adding the value of the active vertex data pair and the value of each data pair in the CSCI, and then updating the value of each data pair in the CSCI;
the calculation result of the processing elements is output to the result generation unit, and the maximum number of data pairs contained in the calculation result is the same as the number of data pairs simultaneously processed by each processing element.
As a further improvement of the invention, the result generation unit comprises: an operation module, a comparator and an on-chip result register;
the operation module comprises: an 8-path 4-stage pipeline tree consisting of 15 operation units;
the calculation capacity of each operation unit is the same as the maximum number of data pairs contained in the input data;
each operation unit is used for calculating the intermediate data input by the mixed granularity processing unit according to the application type input by the control circuit;
the comparator is used for reading a last value corresponding to the row index processed by the operation unit from the on-chip result register according to the row index of each data pair one by one according to the data input by the operation module, comparing the last value with the input current value, if the current value is not smaller than the last value, not executing any operation, and directly calculating the next row index, if the current value is smaller, updating the value of the row index processed by the operation unit in the on-chip result register, setting the vertex corresponding to the row index as an active vertex, outputting the row index to the data access unit, and outputting the row index and value data pair to the mixed granularity processing unit;
the on-chip result register is used for temporarily storing the depth/distance of each vertex.
As a further improvement of the present invention, each of the operation units is specifically configured to: the calculation of the application type I and the calculation of the application type II are both comparison operations, namely, value values of two input data pairs with the same row index are compared, and a smaller value is taken as a new value corresponding to the row index and is output to the next stage until the new value is output from the last stage op _ cell, and then the new value is input to the comparator.
As a further improvement of the present invention, the address calculation module is further configured to determine the physical address PhyAddr according to a formula one according to the base address BaseAddr in the memory of the CSCI;
the formula I is as follows: phyAddr = BaseAddr + (nnz _ c) 0 +nnz_c 1 +...+nnz_c i )/RowSize;
nnz_c i Representing the number of non-zero elements of a corresponding column in the independent sparse adjacency matrix, wherein i is a column index;
RowSize indicates the number of bytes of data stored per line of memory.
The invention has the beneficial effects that:
the graph data compressed by the graph data compression method is applied to the graph computation accelerator, so that the graph computation accelerator efficiently realizes the BFS application, the SSSP application and the like in the graph computation, the effective bandwidth and the parallelism are improved, and the processing process is accelerated.
Detailed Description
For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.
At present, the management of large-scale graph data can adopt various data models, and the data models are divided into a simple graph model and a hypergraph model according to the number of vertices which can be connected by one edge. The invention is directed to a simple graph model, i.e., one edge can only connect two vertices and there can be loops. The graphs in the real world are usually in average degree, i.e. the ratio of the number of edges to the number of vertices, which is only a few to a few hundred, and very sparse compared with the scale of tens of millions or even hundreds of millions of vertices at many places, and the degree is in power law distribution.
The simple graph model may be represented in the form of a sparse adjacency matrix. The map data is stored in a Compressed form in the memory due to its large scale, and the Compressed form includes CSC (Compressed space Column), CSR (Compressed space Row), COO (coordinated List), DCSC (double Compressed space Column), CSCI (Compressed space Column index) mentioned in the present invention, and the like.
Breadth-first search BFS is a basic graph search algorithm and is also the basis of many important graph algorithms. Breadth-first search BFS starts iteratively searching all reachable vertices in a given graph from a given vertex, called the root, and calculates the depth, i.e., the minimum number of edges, from the root vertex to all reachable vertices. At initialization, the depth of the root vertex is set to 0 and marked as active, and the depths of all other vertices are set to infinity. At the t-th iteration, the depth of the vertex v adjacent to the active vertex is calculated as follows. If the depth of a vertex is updated from infinity to t +1, the vertex is marked as active and used for the next iteration. This is repeated until the search is complete.
depth(v)=min(depth(v),t+1)
The single-source shortest-path SSSP is used to compute the shortest-path distances from a specified source vertex to all reachable vertices in a given graph. At initialization, the distance of the source vertex is set to 0 and marked as active, and the distances of all other vertices are set to infinity. At the t-th iteration, assuming the weight of the edge from vertex u to vertex v is w (u, v), the shortest path distance from the source vertex to vertex v is calculated as follows. If the distance of a vertex is updated, the vertex is marked as active and used for the next iteration. This is repeated until all reachable vertices are completed.
distance(v)=min(distance(v),distance(u)+w(u,v))。
The graph computation accelerator is a circuit structure which is used for graph computation, adopts sparse column independent compression and mixed granularity processing units for parallel acceleration, and can be used for BFS, SSSP and other two graph computation applications. The structure and operation of the graph computation accelerator of the present invention will be described in detail below with reference to fig. 1A to 7.
As shown in fig. 1A, the graph computation accelerator structure of the embodiment of the present invention includes a preprocessing circuit (CSCIU, compressed shared Column independent Unit), a control circuit (CTR, conTRoller), a Data access Unit (DAU, data access Unit), a SCheDuler (SCD, SCheDuler), a Mixed-Granularity Processing Unit (MGP), and a Result Generating Unit (RGU).
Wherein the memory in the graph computation accelerator architecture may be understood as a general purpose memory for storing graph computation related data.
A preprocessing circuit (CSCIU) converts input adjacent sparse matrix graph data into an independent sparse column compression format (CSCI) and stores the independent sparse matrix graph data into a memory, and meanwhile, a copy of column identification in the CSCI is stored in the memory, namely, a copy of the number of non-zero elements corresponding to each column in column identification (ioc) is stored in the memory.
The input to the preprocessing circuit may be the original input to the external structure. In addition, there are various simple graph representations, and for this reason, the preprocessing circuit converts graph data in the form of a contiguous matrix.
In this embodiment, the index of the data pair (index, value) in the CSCI format after the graph data compression can be represented by 32 bits, the value can be represented by 16 bits, and the specific meanings of the index and the value are shown in table 1, where the data pair with the index [31 ]. The copies of the non-zero element number corresponding to each column in the column identification are stored on the memory one by one column.
TABLE 1 description of data pairs in the CSCI Format
And after the storage is finished, the preprocessing circuit sends a conversion ready indication signal to the control circuit.
In the present embodiment, the independent sparse column compression CSCI format represents the adjacent sparse matrix of the graph data as a pair of data (index, value) that are compressed independently column by column.
For convenience of explanation, it is assumed that index is represented by 32 bits and value is represented by 16 bits, and the specific application can determine the number of bits representing index and value according to the actual scale of the graph data. Specific meanings of index and value are shown in table 1, where the data pair with index [31 ] as "01" or "10" is the column identification ioc.
The data pair with the highest two bits of the index being "01" or "10" is a column identifier ioc (indicator of column), and the data pair after each column identifier data pair is a data pair corresponding to non-zero elements of all rows of the column;
when the highest two bits of the index are '01', the rest bits of the index represent column indexes, and value represents the number of non-zero elements of the column in the sparse adjacent matrix;
when the highest two bits of the index are '10', the rest bits of the index represent a column index, the column is the last column of the sparse adjacent matrix, and value represents the number of nonzero elements of the column in the sparse adjacent matrix; when the highest two bits of the index are "00", the rest bits of the index represent row indexes, and value represents the corresponding non-zero element value in the sparse adjacency matrix.
For convenience of explanation, the CSCI compression format is illustrated below, and a simple graph shown in fig. 1B is taken as an example, where the graph includes six vertices a, B, C, D, E, and F, and the weight of each edge is indicated in fig. 1B. The sparse adjacent matrix M corresponding to fig. 1B is represented as follows, since fig. 1B includes six vertices, the adjacent matrix is a 6 × 6 matrix, the row and column indexes of the matrix start from 1, and the short horizontal line indicates that there is no connection between two corresponding vertices, and the weight thereof is 0;
column 1 compressed: the last column of the non-matrix has 2 non-zero elements in row 3 and row 4, respectively, so that the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0001,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0011,0000_0000_0000_0011)
(0000_0000_0000_0000_0000_0000_0000_0100,0000_0000_0000_0010)
column 2 compressed: the last column of the column non-matrix, having 1 non-zero element, is located in row 1, so the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0010,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0001,0000_0000_0000_0001)
column 3 compressed: the last column of the column, which is not the matrix, has 1 non-zero element in row 5, so the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0011,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0101,0000_0000_0000_0001)
column 4 compressed: the last column of the column non-matrix, having 1 non-zero element, is located in row 2, so the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0100,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0010,0000_0000_0000_0011)
column 5 compressed: the last column of the non-matrix row, having 2 non-zero elements, is located in row 1 and row 4, respectively, so the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0101,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0001,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0100,0000_0000_0000_0100)
column 6 compressed: the last column of the column-bit matrix, having 2 non-zero elements, is located in row 3 and row 5, respectively, so the column is compressed as:
(1000_0000_0000_0000_0000_0000_0000_0110,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0011,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0101,0000_0000_0000_0011)
after each row is compressed, the compressed rows are sequentially stored in a memory according to a column main sequence.
The compression process can be described as follows:
the adjacent sparse matrix is processed independently by columns from the first column, and when each column is compressed,
1) Counting the number of the non-zero elements of the column to generate a column identification data pair, if the column is not the last column, the column identification index [31 ] is "01", otherwise is "10", the column identification index [29 ] of each column represents a column index, and the column identification value [15 ] of each column represents the number of the non-zero elements of the column;
2) The data pairs are generated by sequentially taking all the non-zero elements of the column by rows, wherein index [31 ] is '00', index [29 ] represents the row index of each non-zero element, and value [15 ] represents the numerical value of each non-zero element.
Referring to fig. 2, the control Circuit (CTR) is composed of two parts of a host interface component (host _ IF) and a control Logic component (CTR _ Logic).
And the host interface (host _ IF) receives and temporarily stores the application type sent by the host and the vertex index corresponding to the application type.
The application types in this embodiment include: application type one for a Breadth First Search (BFS) application, and application type two for a Single Source Shortest Path (SSSP) application.
Wherein, the vertex index applied by the Breadth First Search (BFS) is a root vertex index, and the vertex index applied by the Single Source Shortest Path (SSSP) is a source vertex index.
After receiving a conversion ready indication signal sent by the preprocessing circuit, the control Logic component (Ctr _ Logic) sends the vertex index to the Data Access Unit (DAU), sends the application type to the mixed granularity processing unit (MGP) and the Result Generation Unit (RGU), and starts the accelerator to work.
Referring to fig. 3, a Data Access Unit (DAU) is composed of a user logic component (UI), a column identification register (ioc _ ram), and an address calculation module (addr _ cal).
The user logic component mainly performs three functions:
1) Reading a row identification from a memory, and sending the row identification into a row identification register (ioc _ ram) for temporary storage;
2) Reading data corresponding to the corresponding active vertex from a memory according to the address calculated by the address calculation module (addr _ cal), and determining the number of the read data according to the value in the vertex column identification data pair;
the address calculation module can calculate according to the following formula one, and in addition, the address calculation process can be completed by the index of the active vertex, for this reason, the address calculated for the active vertex, and 2) the corresponding active vertex in the active vertex corresponds to the vertex to which the address belongs.
The active vertex is the vertex that the algorithm has been updated during each iteration, and is used as the active vertex for the next iteration. The root/source vertices are initially designated, which is the first round of active vertices, and subsequent calculations based on the root/source vertices result in each round of active vertices.
3) And sending the read vertex data to a Scheduler (SCD), stopping reading the data from the memory according to a pause reading signal sent by the Scheduler (SCD), and simultaneously saving the current state for reading the data again after the pause reading signal sent by the scheduler is invalid.
The row identification register (ioc _ ram) is used for temporarily storing the row identification of the graph data in the CSCI format.
The address calculation module (addr _ cal) calculates the physical address PhyAddr of the data corresponding to the current active vertex i in the memory according to the vertex indexes input by the control circuit and the result generation unit, in combination with the non-zero element data of each column provided by the column identification temporary memory and the number RowSize of the data stored in each row of the memory, and assuming that the base address of the graph data in the memory in the CSCI format is BaseAddr, the PhyAddr can be expressed as:
the formula I is as follows: phyAddr = BaseAddr + (nnz _ c) 0 +nnz_c 1 +...+nnz_c i )/RowSize。
Referring to fig. 4, the Scheduler (SCD) is composed of a buffer allocation module (buf _ assign), a task scheduling module (task _ sch), and a double buffer module (double _ buffer).
Analyzing the column identification of the column of data (the column data sent by the data access unit) from the graph data in the CSCI format sent by the data access unit by using a buffer area allocation module (buf _ assign) to acquire the number of the to-be-processed data pairs of the column, namely the value of the column identification data pairs, sending the to-be-processed column of data to a double buffer area module (double _ buffer) according to the buffer area state information sent by the double buffer area module (double _ buffer), and sending a reading stopping signal to a Data Access Unit (DAU) when all buffer areas are completely occupied, namely full;
the task scheduling module (task _ sch) sends unscheduled data (data which is not sent into the mixed granularity processing unit) in all buffers into a processing element which is idle and has the calculation capacity meeting the requirement for processing according to a processing element state signal sent by the mixed granularity processing unit (MGP) and buffer area state information sent by the double-buffer module (double _ buffer);
the double buffer module (double _ buffer) is composed of 16 sets of front and rear double buffers with different temporary storage capacities, and the capacity of each buffer is described below.
In the buffer area names, f and b represent a front buffer area and a rear buffer area, 0-7 represent the buffer areas from 0 to 7, and the meanings of 8-11 and 12-13 are similar.
When the buffer zone receives graph data in CSCI format, it determines which buffer zone to temporarily store according to the data pair number of the line of data, each buffer zone is provided with a front buffer zone and a back buffer zone, and the two buffer zones are alternately used in ping-pong mode, initially, the states of the buffer zones are all 'empty', the data are stored in the front buffer zone, namely buf _ f, when the front buffer zone with the same capacity is all occupied, the newly received data are stored in the back buffer zone within the current capacity range, namely buf _ b, and if the front buffer zone with small capacity and the back buffer zone with small capacity are all occupied and the buffer zone with larger capacity is idle, the small number of data can be stored in the large capacity buffer zone.
For example, when the number of data pairs does not exceed 64, the data pairs are sequentially stored in buf0_ f to buf7_ f, if buf0_ f has data and is not read away, buf1_ f is stored, if buf1_ f is occupied, buf2_ f is stored, and so on, when buf0_ f to buf7_ f is completely occupied, the data pairs are sequentially stored in buf0_ b to buf7_ b, and if buf0_ f to buf7_ f and buf0_ b to buf7_ b are all occupied, the data whose number of data pairs does not exceed 64 can be temporarily stored in buf8_ f to buf11_ f, and so on.
Because the data bandwidth of the memory is limited, when a row of data of the graph data in the CSCI format exceeds the data bandwidth of the memory, the data needs to be stored into a buffer area for several times until the stored row of data is all temporarily stored, the state of the buffer area is full, and a task scheduling module (task _ sch) is informed that the row of data can be scheduled, after the data scheduling is finished, the state of the buffer area is empty, and meanwhile, the state is sent to a buffer area allocation module (buf _ assign); when the number of a column of data pairs in the graph data exceeds 1024, namely the maximum capacity of the buffer area, the batch processing can be carried out.
buf 0-7 \ u f, buf 0-7 \ u b;
buf 8-11 \ u f, buf 8-11 \ u b;
buf 12-13 \ u f, buf 12-13 \ u b;
512 data pairs including buf14_ f and buf14_ b;
buf15_ f, buf15_ b:1024 data pairs.
Referring to fig. 5, a mixed granularity processing unit (MGP) is composed of an auxiliary circuit module (aux _ cell), a Processing Element Array (PEA).
And the auxiliary circuit module (aux _ cell) is used for sending the active vertex data input by the Result Generation Unit (RGU) to the corresponding graph data in the CSCI format input by the Scheduler (SCD) to the idle processing element corresponding to the processing element array according to the processing element state.
The above-mentioned pair of jump point data can be understood as: the calculation of BFS/SSSP results in a value for each vertex representing the depth or distance of the vertex in the graph, as does the intermediate result, except that the value is updated during subsequent iterations. And are therefore referred to as active vertex data pairs.
The Processing Element Array (PEA) is composed of 16 processing elements PE of different capacities, and the 16 processing elements can work in parallel, and the processing elements calculate the capacity as follows (here, the data pairs processed do not contain column identification data pairs).
After receiving the active vertex data pair and the graph data in the CSCI format input by the auxiliary circuit module (aux _ cell), the Processing Element (PE) calculates the active vertex data pair and the graph data in the CSCI format (namely the input active vertex data pair and the graph data in the CSCI format) according to the application type sent by the control circuit CTR.
When the application type is Breadth First Search (BFS), the Processing Element (PE) adds 1 to the value of the active vertex data pair to the value of each data pair of the CSCI graph data;
when the application type is Single Source Shortest Path (SSSP), adding the value of the active vertex data pair to the value of each data pair of the CSCI graph data by a Processing Element (PE) for updating the value of each data pair of the graph data in the CSCI format;
the result of the calculation, which contains the same maximum number of pairs as the number of pairs that can be processed simultaneously per Processing Element (PE), is output to a Result Generation Unit (RGU).
PE 0-7, 64 data pairs can be processed simultaneously;
PE 8-11, which can process 128 data pairs at the same time;
PE 12-13, which can process 256 data pairs at the same time;
PE14, can process 512 data pairs at the same time;
and PE15, can process 1024 data pairs simultaneously.
Referring to fig. 6, the Result Generation Unit (RGU) is composed of an operation module (OPC), a Comparator (CMP), and an on-chip result register (cur _ rlt).
Referring to fig. 7 and 6, an operation module (OPC) is composed of an 8-way 4-stage pipeline tree composed of 15 operation units (op _ cells).
The calculation capacity of each op _ cell is the same as the maximum number of data pairs contained in the input data;
each op _ cell calculates intermediate data input by (MGP) according to an application type input by a control Circuit (CTR), and the calculation of breadth-first search (BFS) and single-source shortest path (SSSP) is a comparison operation, namely, values of two paths of input data pairs with the same row index are compared, and a smaller value is output to the next level as a new value corresponding to the row index until the new value is output from the last level op _ cell and then is input to a Comparator (CMP);
a Comparator (CMP) reads a last value corresponding to the row index from an on-chip result register (cur _ rlt) according to the row index of each data pair one by one according to data input by an operation module (OPC), compares the last value with an input current value, does not perform any operation if the current value is not smaller than the last value, directly calculates a next row index, updates the value of the row index in the on-chip result register (cur _ rlt) if the current value is smaller, sets a vertex corresponding to the row index as an active vertex, outputs the row index to a Data Access Unit (DAU), and outputs the row index and the value data pair to a mixed granularity processing unit (MGP); the on-chip result register (cur _ rlt) is used to register the depth (depth) (BFS for breadth first search) and distance (distance) (SSSP for single source shortest path).
The invention is already adopted in the project of 'mixed granularity parallel graph computing accelerator', and the result shows that the function of the circuit meets the expected target through practical verification, the circuit can reliably work, and the aim of the invention is realized.
It should be understood that the above description of specific embodiments of the present invention is only for the purpose of illustrating the technical lines and features of the present invention, and is intended to enable those skilled in the art to understand the contents of the present invention and to implement the present invention, but the present invention is not limited to the above specific embodiments. It is intended to cover in the appended claims all such changes and modifications that are within the scope of this invention.