Nothing Special   »   [go: up one dir, main page]

CN109919826B - Graph data compression method for graph computation accelerator and graph computation accelerator - Google Patents

Graph data compression method for graph computation accelerator and graph computation accelerator Download PDF

Info

Publication number
CN109919826B
CN109919826B CN201910107925.9A CN201910107925A CN109919826B CN 109919826 B CN109919826 B CN 109919826B CN 201910107925 A CN201910107925 A CN 201910107925A CN 109919826 B CN109919826 B CN 109919826B
Authority
CN
China
Prior art keywords
data
graph
column
index
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910107925.9A
Other languages
Chinese (zh)
Other versions
CN109919826A (en
Inventor
邓军勇
莉兹·K·约翰
宋爽
邬沁哲
杨博文
田璞
赵一迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
University of Texas System
Original Assignee
Xian University of Posts and Telecommunications
University of Texas at Austin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications, University of Texas at Austin filed Critical Xian University of Posts and Telecommunications
Priority to CN201910107925.9A priority Critical patent/CN109919826B/en
Publication of CN109919826A publication Critical patent/CN109919826A/en
Application granted granted Critical
Publication of CN109919826B publication Critical patent/CN109919826B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a graph data compression method for a graph computation accelerator and the graph computation accelerator, wherein the method comprises the following steps: s1, a preprocessing circuit of a graph computation accelerator converts graph data to be processed and represented by an adjacent sparse matrix into graph data in an independent sparse column compression CSCI format, each column of the independently compressed graph data comprises a column identification data pair and a non-zero element data pair, each data pair comprises an index and a value, the highest two bits of the index indicate the meanings of the rest bits of the index and the value, and S2, the preprocessing circuit of the graph computation accelerator stores the converted graph data in the CSCI format in a memory of the graph computation accelerator. The compression method can improve the parallelism and the energy efficiency of the graph computation accelerator.

Description

Graph data compression method for graph computation accelerator and graph computation accelerator
Technical Field
The present invention relates to data compression technologies, and in particular, to a graph data compression method for a graph computation accelerator and a graph computation accelerator.
Background
With the rise of new internet applications such as social networks and the popularization of various electronic devices, graph computation, especially the related application of large-scale graph computation, increasingly becomes a research hotspot in academia and industry, and the research and development of computing accelerators viewed from the perspectives of technology, application, independent intellectual property rights and the like are imperative.
Currently, a variety of graph computation accelerators are designed in the industry, and in these graph computation accelerators, efficient compression formats of graph data need to be considered to improve the parallelism and energy efficiency of the computation.
Disclosure of Invention
The invention aims to provide a graph data compression method for a graph computation accelerator and the graph computation accelerator, which can improve the parallelism and the energy efficiency of the graph computation accelerator.
In order to achieve the purpose, the invention adopts the main technical scheme that:
the invention provides a graph data compression method for a graph computation accelerator, which comprises the following steps:
s1, a preprocessing circuit of a graph computation accelerator converts graph data to be processed and represented by an adjacent sparse matrix into graph data in an independent sparse column compression CSCI format, each column of the independently compressed graph data comprises a column identification data pair and a non-zero element data pair, each data pair comprises an index and a value, the highest two bits of the index indicate the rest bits of the index and the value,
s2, the preprocessing circuit of the graph calculation accelerator stores the converted graph data in the CSCI format in a memory of the graph calculation accelerator.
As a further improvement of the present invention, the step S1 includes:
independently compressing the graph data represented by the sparse adjacency matrix into individual data pairs according to columns;
each data pair structure includes: index and value;
the data pair with the highest two index bits of "01" or "10" is a column identifier ioc;
the data pairs following the data pairs identified as columns are the data pairs corresponding to non-zero elements of all rows of the column.
As a further improvement of the present invention, when the highest two bits of the index are "01", the rest bits of the index represent the column index, and value represents the number of non-zero elements of the column in the adjacent sparse matrix;
when the highest two bits of the index are '10', the rest bits of the index represent a column index, the column is the last column of the adjacent sparse matrix, and value represents the number of nonzero elements of the column in the adjacent sparse matrix;
when the highest two bits of the index are '00', the rest bits of the index represent row indexes, and value represents a corresponding non-zero element value in the sparse adjacent matrix.
As a further improvement of the present invention, the number of bits of the index and value is determined according to the data amount of the adjacent sparse matrix data.
In another aspect, the present invention also provides a graph computation accelerator, comprising a preprocessing circuit and a memory;
the preprocessing circuit performs a conversion process on the neighboring sparse matrix data according to the compression method of any one of the preceding claims 1 to 4.
As a further improvement of the invention, the method also comprises the following steps:
the device comprises a control circuit, a data access unit, a scheduler, a mixed granularity processing unit and a result generation unit;
the preprocessing circuit is further used for storing the column identification copy in the CSCI into the memory;
the control circuit is used for receiving a conversion ready indication signal sent by the preprocessing circuit after the preprocessing circuit finishes storing in the memory, calculating an application type according to a graph sent by the host, controlling the operation of the data access unit, the mixed granularity processing unit and the result generating unit, and sending a root vertex index of an application type I or a source vertex index of an application type II sent by the host to the data access unit;
the data access unit is used for reading the graph data and the column identification of the CSCI from the memory, calculating the physical address of the appointed vertex in the memory according to the root vertex index, the source vertex index or the active vertex index transmitted by the result generation unit so as to access the data, and transmitting the read data to the scheduler;
the scheduler is used for temporarily storing the number of the non-zero elements indicated by the column identifiers in the CISI and distributing the temporarily stored data to the processing elements in the mixed granularity processing unit for processing according to the state signals of the processing elements in the mixed granularity processing unit;
the mixed granularity processing unit is used for carrying out parallel processing on the data temporarily stored in the scheduler according to the application type in the control circuit and the active vertex data of the result generating unit and transmitting the processed intermediate data to the result generating unit;
and the result generating unit is used for processing the intermediate data according to the application type in the control circuit, sending the active vertex index of the processing process to the data access unit and storing the processed final result.
As a further improvement of the present invention, the control circuit includes: a host interface component and a control logic component;
the host interface component is used for receiving an application type, a root vertex index of the application type I and a source vertex index of the application type II which are sent by a host;
the control logic component is used for receiving a conversion ready indication signal sent by the preprocessing circuit, sending the root vertex index or the source vertex index to the data access unit, sending the application type to the mixed granularity processing unit and the result generating unit, and starting each module in the graph computation accelerator to start working;
the application type is a breadth-first search application BFS type, and the application type is a single-source shortest path application SSSP type.
As a further improvement of the present invention, the data access unit includes: the system comprises a user logic component, an address calculation module and a row identification temporary register;
the row identification temporary storage is used for storing row identifications of the graph data in the CSCI;
the address calculation module is used for calculating the physical address of the data corresponding to the current active vertex i in the memory according to the vertex index sent by the control circuit and input by the result generation unit and by combining the number of each row of non-zero element data and each row of storage data in the row identification temporary storage;
the user logic component is used for reading the column identification from the memory and temporarily storing the column identification in the column identification temporary storage; reading data corresponding to the corresponding active vertex from the memory according to the address calculated by the address calculation module, and sending the read data to the scheduler;
after receiving the pause reading signal sent by the scheduler, stopping reading the data from the memory;
the user logic component is also used for reading data again after the pause reading signal sent by the scheduler fails.
As a further improvement of the present invention, the scheduler comprises: the system comprises a buffer area distribution module, a task scheduling module and a double-buffer area module;
the buffer area distribution module is used for analyzing a data pair corresponding to the column identifier of the column data transmitted from the data access unit, transmitting the graph data corresponding to the column identifier to the double-buffer area module according to the buffer area state information transmitted by the double-buffer area module, and transmitting a reading stop signal to the data access unit when all the buffer areas in the double-buffer area module are occupied;
the task scheduling module is used for sending unscheduled data in all the buffers into a processing element which is idle and has the calculation capacity meeting the requirement for processing according to the processing element state signal transmitted by the mixed granularity processing unit and the buffer state information transmitted by the double-buffer module;
the double-buffer module comprises: a plurality of groups of front and back double buffer areas with different temporary storage capacities;
the double-buffer module is used for informing the task scheduling module to schedule the graph data cached by the buffer when the states of all the buffers are full, setting the state of the buffer after the data scheduling to be empty, and sending the state of the buffer to the buffer allocation module.
As a further improvement of the invention, the mixed particle size processing unit comprises: an auxiliary circuit module and a processing element array;
the auxiliary circuit module is used for transmitting the active vertex data pair input by the result generating unit and the corresponding CSCI input by the scheduler to the corresponding idle processing element in the processing element array according to the state of each processing element in the processing element array;
the processing element array consists of a plurality of processing elements PE with different capacities, and the plurality of processing elements work in parallel;
and after each processing element receives the active vertex data pair and the CSCI input by the auxiliary circuit module, calculating the active vertex data pair and the CSCI according to the application type transmitted by the control circuit.
As a further improvement of the present invention, each processing element is specifically configured to:
when the application type transmitted by the control circuit is breadth-first search, adding 1 to the value of each data pair in the CSCI by the value of the active vertex data pair;
when the application type is the single-source shortest path, adding the value of the active vertex data pair and the value of each data pair in the CSCI, and then updating the value of each data pair in the CSCI;
the calculation result of the processing elements is output to the result generation unit, and the maximum number of data pairs contained in the calculation result is the same as the number of data pairs simultaneously processed by each processing element.
As a further improvement of the invention, the result generation unit comprises: an operation module, a comparator and an on-chip result register;
the operation module comprises: an 8-path 4-stage pipeline tree consisting of 15 operation units;
the calculation capacity of each operation unit is the same as the maximum number of data pairs contained in the input data;
each operation unit is used for calculating the intermediate data input by the mixed granularity processing unit according to the application type input by the control circuit;
the comparator is used for reading a last value corresponding to the row index processed by the operation unit from the on-chip result register according to the row index of each data pair one by one according to the data input by the operation module, comparing the last value with the input current value, if the current value is not smaller than the last value, not executing any operation, and directly calculating the next row index, if the current value is smaller, updating the value of the row index processed by the operation unit in the on-chip result register, setting the vertex corresponding to the row index as an active vertex, outputting the row index to the data access unit, and outputting the row index and value data pair to the mixed granularity processing unit;
the on-chip result register is used for temporarily storing the depth/distance of each vertex.
As a further improvement of the present invention, each of the operation units is specifically configured to: the calculation of the application type I and the calculation of the application type II are both comparison operations, namely, value values of two input data pairs with the same row index are compared, and a smaller value is taken as a new value corresponding to the row index and is output to the next stage until the new value is output from the last stage op _ cell, and then the new value is input to the comparator.
As a further improvement of the present invention, the address calculation module is further configured to determine the physical address PhyAddr according to a formula one according to the base address BaseAddr in the memory of the CSCI;
the formula I is as follows: phyAddr = BaseAddr + (nnz _ c) 0 +nnz_c 1 +...+nnz_c i )/RowSize;
nnz_c i Representing the number of non-zero elements of a corresponding column in the independent sparse adjacency matrix, wherein i is a column index;
RowSize indicates the number of bytes of data stored per line of memory.
The invention has the beneficial effects that:
the graph data compressed by the graph data compression method is applied to the graph computation accelerator, so that the graph computation accelerator efficiently realizes the BFS application, the SSSP application and the like in the graph computation, the effective bandwidth and the parallelism are improved, and the processing process is accelerated.
Drawings
FIG. 1A is a schematic diagram of a graph computation accelerator according to an embodiment of the present invention;
FIG. 1B is a schematic diagram of a graph data compression process for a graph computation accelerator according to the present invention;
FIG. 2 is a schematic diagram of the control circuit of FIG. 1A;
FIG. 3 is a circuit configuration diagram of the data access unit in FIG. 1A;
FIG. 4 is a circuit configuration diagram of the scheduler of FIG. 1A;
FIG. 5 is a circuit configuration diagram of the mixed-granularity processing unit of FIG. 1A;
FIG. 6 is a circuit configuration diagram of the result generation unit in FIG. 1A;
fig. 7 is a circuit configuration diagram of the operation block of fig. 6.
Detailed Description
For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.
At present, the management of large-scale graph data can adopt various data models, and the data models are divided into a simple graph model and a hypergraph model according to the number of vertices which can be connected by one edge. The invention is directed to a simple graph model, i.e., one edge can only connect two vertices and there can be loops. The graphs in the real world are usually in average degree, i.e. the ratio of the number of edges to the number of vertices, which is only a few to a few hundred, and very sparse compared with the scale of tens of millions or even hundreds of millions of vertices at many places, and the degree is in power law distribution.
The simple graph model may be represented in the form of a sparse adjacency matrix. The map data is stored in a Compressed form in the memory due to its large scale, and the Compressed form includes CSC (Compressed space Column), CSR (Compressed space Row), COO (coordinated List), DCSC (double Compressed space Column), CSCI (Compressed space Column index) mentioned in the present invention, and the like.
Breadth-first search BFS is a basic graph search algorithm and is also the basis of many important graph algorithms. Breadth-first search BFS starts iteratively searching all reachable vertices in a given graph from a given vertex, called the root, and calculates the depth, i.e., the minimum number of edges, from the root vertex to all reachable vertices. At initialization, the depth of the root vertex is set to 0 and marked as active, and the depths of all other vertices are set to infinity. At the t-th iteration, the depth of the vertex v adjacent to the active vertex is calculated as follows. If the depth of a vertex is updated from infinity to t +1, the vertex is marked as active and used for the next iteration. This is repeated until the search is complete.
depth(v)=min(depth(v),t+1)
The single-source shortest-path SSSP is used to compute the shortest-path distances from a specified source vertex to all reachable vertices in a given graph. At initialization, the distance of the source vertex is set to 0 and marked as active, and the distances of all other vertices are set to infinity. At the t-th iteration, assuming the weight of the edge from vertex u to vertex v is w (u, v), the shortest path distance from the source vertex to vertex v is calculated as follows. If the distance of a vertex is updated, the vertex is marked as active and used for the next iteration. This is repeated until all reachable vertices are completed.
distance(v)=min(distance(v),distance(u)+w(u,v))。
The graph computation accelerator is a circuit structure which is used for graph computation, adopts sparse column independent compression and mixed granularity processing units for parallel acceleration, and can be used for BFS, SSSP and other two graph computation applications. The structure and operation of the graph computation accelerator of the present invention will be described in detail below with reference to fig. 1A to 7.
As shown in fig. 1A, the graph computation accelerator structure of the embodiment of the present invention includes a preprocessing circuit (CSCIU, compressed shared Column independent Unit), a control circuit (CTR, conTRoller), a Data access Unit (DAU, data access Unit), a SCheDuler (SCD, SCheDuler), a Mixed-Granularity Processing Unit (MGP), and a Result Generating Unit (RGU).
Wherein the memory in the graph computation accelerator architecture may be understood as a general purpose memory for storing graph computation related data.
A preprocessing circuit (CSCIU) converts input adjacent sparse matrix graph data into an independent sparse column compression format (CSCI) and stores the independent sparse matrix graph data into a memory, and meanwhile, a copy of column identification in the CSCI is stored in the memory, namely, a copy of the number of non-zero elements corresponding to each column in column identification (ioc) is stored in the memory.
The input to the preprocessing circuit may be the original input to the external structure. In addition, there are various simple graph representations, and for this reason, the preprocessing circuit converts graph data in the form of a contiguous matrix.
In this embodiment, the index of the data pair (index, value) in the CSCI format after the graph data compression can be represented by 32 bits, the value can be represented by 16 bits, and the specific meanings of the index and the value are shown in table 1, where the data pair with the index [31 ]. The copies of the non-zero element number corresponding to each column in the column identification are stored on the memory one by one column.
TABLE 1 description of data pairs in the CSCI Format
Figure BDA0001967147560000081
And after the storage is finished, the preprocessing circuit sends a conversion ready indication signal to the control circuit.
In the present embodiment, the independent sparse column compression CSCI format represents the adjacent sparse matrix of the graph data as a pair of data (index, value) that are compressed independently column by column.
For convenience of explanation, it is assumed that index is represented by 32 bits and value is represented by 16 bits, and the specific application can determine the number of bits representing index and value according to the actual scale of the graph data. Specific meanings of index and value are shown in table 1, where the data pair with index [31 ] as "01" or "10" is the column identification ioc.
The data pair with the highest two bits of the index being "01" or "10" is a column identifier ioc (indicator of column), and the data pair after each column identifier data pair is a data pair corresponding to non-zero elements of all rows of the column;
when the highest two bits of the index are '01', the rest bits of the index represent column indexes, and value represents the number of non-zero elements of the column in the sparse adjacent matrix;
when the highest two bits of the index are '10', the rest bits of the index represent a column index, the column is the last column of the sparse adjacent matrix, and value represents the number of nonzero elements of the column in the sparse adjacent matrix; when the highest two bits of the index are "00", the rest bits of the index represent row indexes, and value represents the corresponding non-zero element value in the sparse adjacency matrix.
For convenience of explanation, the CSCI compression format is illustrated below, and a simple graph shown in fig. 1B is taken as an example, where the graph includes six vertices a, B, C, D, E, and F, and the weight of each edge is indicated in fig. 1B. The sparse adjacent matrix M corresponding to fig. 1B is represented as follows, since fig. 1B includes six vertices, the adjacent matrix is a 6 × 6 matrix, the row and column indexes of the matrix start from 1, and the short horizontal line indicates that there is no connection between two corresponding vertices, and the weight thereof is 0;
Figure BDA0001967147560000091
column 1 compressed: the last column of the non-matrix has 2 non-zero elements in row 3 and row 4, respectively, so that the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0001,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0011,0000_0000_0000_0011)
(0000_0000_0000_0000_0000_0000_0000_0100,0000_0000_0000_0010)
column 2 compressed: the last column of the column non-matrix, having 1 non-zero element, is located in row 1, so the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0010,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0001,0000_0000_0000_0001)
column 3 compressed: the last column of the column, which is not the matrix, has 1 non-zero element in row 5, so the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0011,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0101,0000_0000_0000_0001)
column 4 compressed: the last column of the column non-matrix, having 1 non-zero element, is located in row 2, so the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0100,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0010,0000_0000_0000_0011)
column 5 compressed: the last column of the non-matrix row, having 2 non-zero elements, is located in row 1 and row 4, respectively, so the column is compressed as:
(0100_0000_0000_0000_0000_0000_0000_0101,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0001,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0100,0000_0000_0000_0100)
column 6 compressed: the last column of the column-bit matrix, having 2 non-zero elements, is located in row 3 and row 5, respectively, so the column is compressed as:
(1000_0000_0000_0000_0000_0000_0000_0110,0000_0000_0000_0010)
(0000_0000_0000_0000_0000_0000_0000_0011,0000_0000_0000_0001)
(0000_0000_0000_0000_0000_0000_0000_0101,0000_0000_0000_0011)
after each row is compressed, the compressed rows are sequentially stored in a memory according to a column main sequence.
The compression process can be described as follows:
the adjacent sparse matrix is processed independently by columns from the first column, and when each column is compressed,
1) Counting the number of the non-zero elements of the column to generate a column identification data pair, if the column is not the last column, the column identification index [31 ] is "01", otherwise is "10", the column identification index [29 ] of each column represents a column index, and the column identification value [15 ] of each column represents the number of the non-zero elements of the column;
2) The data pairs are generated by sequentially taking all the non-zero elements of the column by rows, wherein index [31 ] is '00', index [29 ] represents the row index of each non-zero element, and value [15 ] represents the numerical value of each non-zero element.
Referring to fig. 2, the control Circuit (CTR) is composed of two parts of a host interface component (host _ IF) and a control Logic component (CTR _ Logic).
And the host interface (host _ IF) receives and temporarily stores the application type sent by the host and the vertex index corresponding to the application type.
The application types in this embodiment include: application type one for a Breadth First Search (BFS) application, and application type two for a Single Source Shortest Path (SSSP) application.
Wherein, the vertex index applied by the Breadth First Search (BFS) is a root vertex index, and the vertex index applied by the Single Source Shortest Path (SSSP) is a source vertex index.
After receiving a conversion ready indication signal sent by the preprocessing circuit, the control Logic component (Ctr _ Logic) sends the vertex index to the Data Access Unit (DAU), sends the application type to the mixed granularity processing unit (MGP) and the Result Generation Unit (RGU), and starts the accelerator to work.
Referring to fig. 3, a Data Access Unit (DAU) is composed of a user logic component (UI), a column identification register (ioc _ ram), and an address calculation module (addr _ cal).
The user logic component mainly performs three functions:
1) Reading a row identification from a memory, and sending the row identification into a row identification register (ioc _ ram) for temporary storage;
2) Reading data corresponding to the corresponding active vertex from a memory according to the address calculated by the address calculation module (addr _ cal), and determining the number of the read data according to the value in the vertex column identification data pair;
the address calculation module can calculate according to the following formula one, and in addition, the address calculation process can be completed by the index of the active vertex, for this reason, the address calculated for the active vertex, and 2) the corresponding active vertex in the active vertex corresponds to the vertex to which the address belongs.
The active vertex is the vertex that the algorithm has been updated during each iteration, and is used as the active vertex for the next iteration. The root/source vertices are initially designated, which is the first round of active vertices, and subsequent calculations based on the root/source vertices result in each round of active vertices.
3) And sending the read vertex data to a Scheduler (SCD), stopping reading the data from the memory according to a pause reading signal sent by the Scheduler (SCD), and simultaneously saving the current state for reading the data again after the pause reading signal sent by the scheduler is invalid.
The row identification register (ioc _ ram) is used for temporarily storing the row identification of the graph data in the CSCI format.
The address calculation module (addr _ cal) calculates the physical address PhyAddr of the data corresponding to the current active vertex i in the memory according to the vertex indexes input by the control circuit and the result generation unit, in combination with the non-zero element data of each column provided by the column identification temporary memory and the number RowSize of the data stored in each row of the memory, and assuming that the base address of the graph data in the memory in the CSCI format is BaseAddr, the PhyAddr can be expressed as:
the formula I is as follows: phyAddr = BaseAddr + (nnz _ c) 0 +nnz_c 1 +...+nnz_c i )/RowSize。
Referring to fig. 4, the Scheduler (SCD) is composed of a buffer allocation module (buf _ assign), a task scheduling module (task _ sch), and a double buffer module (double _ buffer).
Analyzing the column identification of the column of data (the column data sent by the data access unit) from the graph data in the CSCI format sent by the data access unit by using a buffer area allocation module (buf _ assign) to acquire the number of the to-be-processed data pairs of the column, namely the value of the column identification data pairs, sending the to-be-processed column of data to a double buffer area module (double _ buffer) according to the buffer area state information sent by the double buffer area module (double _ buffer), and sending a reading stopping signal to a Data Access Unit (DAU) when all buffer areas are completely occupied, namely full;
the task scheduling module (task _ sch) sends unscheduled data (data which is not sent into the mixed granularity processing unit) in all buffers into a processing element which is idle and has the calculation capacity meeting the requirement for processing according to a processing element state signal sent by the mixed granularity processing unit (MGP) and buffer area state information sent by the double-buffer module (double _ buffer);
the double buffer module (double _ buffer) is composed of 16 sets of front and rear double buffers with different temporary storage capacities, and the capacity of each buffer is described below.
In the buffer area names, f and b represent a front buffer area and a rear buffer area, 0-7 represent the buffer areas from 0 to 7, and the meanings of 8-11 and 12-13 are similar.
When the buffer zone receives graph data in CSCI format, it determines which buffer zone to temporarily store according to the data pair number of the line of data, each buffer zone is provided with a front buffer zone and a back buffer zone, and the two buffer zones are alternately used in ping-pong mode, initially, the states of the buffer zones are all 'empty', the data are stored in the front buffer zone, namely buf _ f, when the front buffer zone with the same capacity is all occupied, the newly received data are stored in the back buffer zone within the current capacity range, namely buf _ b, and if the front buffer zone with small capacity and the back buffer zone with small capacity are all occupied and the buffer zone with larger capacity is idle, the small number of data can be stored in the large capacity buffer zone.
For example, when the number of data pairs does not exceed 64, the data pairs are sequentially stored in buf0_ f to buf7_ f, if buf0_ f has data and is not read away, buf1_ f is stored, if buf1_ f is occupied, buf2_ f is stored, and so on, when buf0_ f to buf7_ f is completely occupied, the data pairs are sequentially stored in buf0_ b to buf7_ b, and if buf0_ f to buf7_ f and buf0_ b to buf7_ b are all occupied, the data whose number of data pairs does not exceed 64 can be temporarily stored in buf8_ f to buf11_ f, and so on.
Because the data bandwidth of the memory is limited, when a row of data of the graph data in the CSCI format exceeds the data bandwidth of the memory, the data needs to be stored into a buffer area for several times until the stored row of data is all temporarily stored, the state of the buffer area is full, and a task scheduling module (task _ sch) is informed that the row of data can be scheduled, after the data scheduling is finished, the state of the buffer area is empty, and meanwhile, the state is sent to a buffer area allocation module (buf _ assign); when the number of a column of data pairs in the graph data exceeds 1024, namely the maximum capacity of the buffer area, the batch processing can be carried out.
buf 0-7 \ u f, buf 0-7 \ u b;
buf 8-11 \ u f, buf 8-11 \ u b;
buf 12-13 \ u f, buf 12-13 \ u b;
512 data pairs including buf14_ f and buf14_ b;
buf15_ f, buf15_ b:1024 data pairs.
Referring to fig. 5, a mixed granularity processing unit (MGP) is composed of an auxiliary circuit module (aux _ cell), a Processing Element Array (PEA).
And the auxiliary circuit module (aux _ cell) is used for sending the active vertex data input by the Result Generation Unit (RGU) to the corresponding graph data in the CSCI format input by the Scheduler (SCD) to the idle processing element corresponding to the processing element array according to the processing element state.
The above-mentioned pair of jump point data can be understood as: the calculation of BFS/SSSP results in a value for each vertex representing the depth or distance of the vertex in the graph, as does the intermediate result, except that the value is updated during subsequent iterations. And are therefore referred to as active vertex data pairs.
The Processing Element Array (PEA) is composed of 16 processing elements PE of different capacities, and the 16 processing elements can work in parallel, and the processing elements calculate the capacity as follows (here, the data pairs processed do not contain column identification data pairs).
After receiving the active vertex data pair and the graph data in the CSCI format input by the auxiliary circuit module (aux _ cell), the Processing Element (PE) calculates the active vertex data pair and the graph data in the CSCI format (namely the input active vertex data pair and the graph data in the CSCI format) according to the application type sent by the control circuit CTR.
When the application type is Breadth First Search (BFS), the Processing Element (PE) adds 1 to the value of the active vertex data pair to the value of each data pair of the CSCI graph data;
when the application type is Single Source Shortest Path (SSSP), adding the value of the active vertex data pair to the value of each data pair of the CSCI graph data by a Processing Element (PE) for updating the value of each data pair of the graph data in the CSCI format;
the result of the calculation, which contains the same maximum number of pairs as the number of pairs that can be processed simultaneously per Processing Element (PE), is output to a Result Generation Unit (RGU).
PE 0-7, 64 data pairs can be processed simultaneously;
PE 8-11, which can process 128 data pairs at the same time;
PE 12-13, which can process 256 data pairs at the same time;
PE14, can process 512 data pairs at the same time;
and PE15, can process 1024 data pairs simultaneously.
Referring to fig. 6, the Result Generation Unit (RGU) is composed of an operation module (OPC), a Comparator (CMP), and an on-chip result register (cur _ rlt).
Referring to fig. 7 and 6, an operation module (OPC) is composed of an 8-way 4-stage pipeline tree composed of 15 operation units (op _ cells).
The calculation capacity of each op _ cell is the same as the maximum number of data pairs contained in the input data;
each op _ cell calculates intermediate data input by (MGP) according to an application type input by a control Circuit (CTR), and the calculation of breadth-first search (BFS) and single-source shortest path (SSSP) is a comparison operation, namely, values of two paths of input data pairs with the same row index are compared, and a smaller value is output to the next level as a new value corresponding to the row index until the new value is output from the last level op _ cell and then is input to a Comparator (CMP);
a Comparator (CMP) reads a last value corresponding to the row index from an on-chip result register (cur _ rlt) according to the row index of each data pair one by one according to data input by an operation module (OPC), compares the last value with an input current value, does not perform any operation if the current value is not smaller than the last value, directly calculates a next row index, updates the value of the row index in the on-chip result register (cur _ rlt) if the current value is smaller, sets a vertex corresponding to the row index as an active vertex, outputs the row index to a Data Access Unit (DAU), and outputs the row index and the value data pair to a mixed granularity processing unit (MGP); the on-chip result register (cur _ rlt) is used to register the depth (depth) (BFS for breadth first search) and distance (distance) (SSSP for single source shortest path).
The invention is already adopted in the project of 'mixed granularity parallel graph computing accelerator', and the result shows that the function of the circuit meets the expected target through practical verification, the circuit can reliably work, and the aim of the invention is realized.
It should be understood that the above description of specific embodiments of the present invention is only for the purpose of illustrating the technical lines and features of the present invention, and is intended to enable those skilled in the art to understand the contents of the present invention and to implement the present invention, but the present invention is not limited to the above specific embodiments. It is intended to cover in the appended claims all such changes and modifications that are within the scope of this invention.

Claims (8)

1. A graph computation accelerator comprising preprocessing circuitry and memory;
the preprocessing circuit carries out conversion processing on adjacent sparse matrix data according to a graph data compression method of a graph calculation accelerator;
specifically, the preprocessing circuit of the graph computation accelerator converts graph data to be processed, which are represented by a contiguous sparse matrix, into graph data in an independent sparse column compression CSCI format, each column of the independently compressed graph data comprises a column identification data pair and a non-zero element data pair, each data pair comprises an index and a value, the meaning of the rest bits of the index and the value is indicated by the highest two bits of the index, and the preprocessing circuit of the graph computation accelerator stores the converted graph data in the CSCI format in a memory of the graph computation accelerator;
and, the graph computation accelerator further comprises:
the device comprises a control circuit, a data access unit, a scheduler, a mixed granularity processing unit and a result generation unit;
the preprocessing circuit is further used for storing the column identification copy in the CSCI into the memory;
the control circuit is used for receiving a conversion ready indication signal sent by the preprocessing circuit after the storage in the memory is finished, calculating the application type according to a graph sent by the host to control the operation of the data access unit, the mixed granularity processing unit and the result generation unit, and sending a root vertex index of the application type I or a source vertex index of the application type II sent by the host to the data access unit;
the data access unit is used for reading the graph data and the column identification of the CSCI from the memory, calculating the physical address of the appointed vertex in the memory according to the root vertex index, the source vertex index or the active vertex index transmitted by the result generation unit so as to access the data, and transmitting the read data to the scheduler;
the scheduler is used for temporarily storing the number of the non-zero elements indicated by the column identifiers in the CISI and distributing the temporarily stored data to the processing elements in the mixed granularity processing unit for processing according to the state signals of the processing elements in the mixed granularity processing unit;
the mixed granularity processing unit is used for carrying out parallel processing on the data temporarily stored in the scheduler according to the application type in the control circuit and the active vertex data of the result generating unit and transmitting the processed intermediate data to the result generating unit;
and the result generating unit is used for processing the intermediate data according to the application type in the control circuit, sending the active vertex index of the processing process to the data access unit and storing the processed final result.
2. The map computation accelerator of claim 1, wherein the control circuit comprises: a host interface component and a control logic component;
the host interface component is used for receiving an application type, a root vertex index of the application type I and a source vertex index of the application type II which are sent by a host;
the control logic component is used for receiving a conversion ready indication signal sent by the preprocessing circuit, sending the root vertex index or the source vertex index to the data access unit, sending the application type to the mixed granularity processing unit and the result generating unit, and starting each module in the graph computation accelerator to start working;
the application type is a breadth-first search application BFS type, and the application type is a single-source shortest path application SSSP type.
3. The graph computation accelerator of claim 2, wherein the data access unit comprises: the system comprises a user logic component, an address calculation module and a row identification temporary memory;
the row identification temporary storage is used for storing row identification of the graph data in the CSCI;
the address calculation module is used for calculating the physical address of the data corresponding to the current active vertex i in the memory according to the vertex index sent by the control circuit and input by the result generation unit and by combining the number of each row of non-zero element data and each row of storage data in the row identification temporary storage;
the user logic component is used for reading the column identification from the memory and temporarily storing the column identification in the column identification temporary storage; reading data corresponding to the corresponding active vertex from the memory according to the address calculated by the address calculation module, and sending the read data to the scheduler;
after receiving the pause reading signal sent by the scheduler, stopping reading the data from the memory;
and the user logic component is also used for reading the data again after the pause reading signal sent by the scheduler is failed.
4. The graph computation accelerator of claim 3, wherein the scheduler comprises: the system comprises a buffer area distribution module, a task scheduling module and a double-buffer area module;
the buffer area distribution module is used for analyzing a data pair corresponding to the column identifier of the column data transmitted from the data access unit, transmitting the graph data corresponding to the column identifier to the double-buffer area module according to the buffer area state information transmitted by the double-buffer area module, and transmitting a reading stop signal to the data access unit when all the buffer areas in the double-buffer area module are occupied;
the task scheduling module is used for sending unscheduled data in all the buffers into a processing element which is idle and has the calculation capacity meeting the requirement for processing according to the processing element state signal transmitted by the mixed granularity processing unit and the buffer state information transmitted by the double-buffer module;
the double-buffer module comprises: a plurality of groups of front and back double buffer areas with different temporary storage capacities;
the double-buffer module is used for informing the task scheduling module to schedule the graph data cached by the buffer when the states of all the buffers are full, setting the state of the buffer after the data scheduling to be empty, and sending the state of the buffer to the buffer allocation module.
5. The map computation accelerator of claim 4, wherein the mixed-granularity processing unit comprises: an auxiliary circuit module and a processing element array;
the auxiliary circuit module is used for transmitting the active vertex data pair input by the result generation unit and the CSCI corresponding to the input of the scheduler to the corresponding idle processing element in the processing element array according to the state of each processing element in the processing element array;
the processing element array consists of a plurality of processing elements PE with different capacities, and the plurality of processing elements work in parallel;
and after each processing element receives the active vertex data pair and the CSCI input by the auxiliary circuit module, calculating the active vertex data pair and the CSCI according to the application type transmitted by the control circuit.
6. The graph computation accelerator of claim 1, wherein the preprocessing circuitry of the graph computation accelerator is specifically configured to:
independently compressing the graph data represented by the sparse adjacency matrix into individual data pairs according to columns;
each data pair structure includes: index and value;
the data pair with the highest two index bits of "01" or "10" is the column identifier ioc;
the data pairs following the data pairs identified as columns are the data pairs corresponding to non-zero elements of all rows of the column.
7. The map computation accelerator of claim 6,
when the highest two bits of the index are '01', the rest bits of the index represent column indexes, and value represents the number of non-zero elements of the column in the adjacent sparse matrix;
when the highest two bits of the index are '10', the rest bits of the index represent a column index, the column is the last column of the adjacent sparse matrix, and value represents the number of nonzero elements of the column in the adjacent sparse matrix;
when the highest two bits of the index are "00", the rest bits of the index represent row indexes, and value represents the corresponding non-zero element value in the sparse adjacency matrix.
8. The graph computation accelerator of claim 6, wherein the number of bits of index and value is determined according to the amount of data of the contiguous sparse matrix data.
CN201910107925.9A 2019-02-02 2019-02-02 Graph data compression method for graph computation accelerator and graph computation accelerator Active CN109919826B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910107925.9A CN109919826B (en) 2019-02-02 2019-02-02 Graph data compression method for graph computation accelerator and graph computation accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910107925.9A CN109919826B (en) 2019-02-02 2019-02-02 Graph data compression method for graph computation accelerator and graph computation accelerator

Publications (2)

Publication Number Publication Date
CN109919826A CN109919826A (en) 2019-06-21
CN109919826B true CN109919826B (en) 2023-02-17

Family

ID=66961445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910107925.9A Active CN109919826B (en) 2019-02-02 2019-02-02 Graph data compression method for graph computation accelerator and graph computation accelerator

Country Status (1)

Country Link
CN (1) CN109919826B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598175B (en) * 2019-09-17 2021-01-01 西安邮电大学 Sparse matrix column vector comparison device based on graph computation accelerator
CN111309976B (en) * 2020-02-24 2021-06-25 北京工业大学 GraphX data caching method for convergence graph application
CN113326125B (en) * 2021-05-20 2023-03-24 清华大学 Large-scale distributed graph calculation end-to-end acceleration method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636273A (en) * 2015-02-28 2015-05-20 中国科学技术大学 Storage method of sparse matrix on SIMD multi-core processor with multi-level cache
CN106951961A (en) * 2017-02-24 2017-07-14 清华大学 The convolutional neural networks accelerator and system of a kind of coarseness restructural
CN107155358A (en) * 2012-08-02 2017-09-12 希捷科技有限公司 Combination grain higher level redundancy for nonvolatile memory
CN107229967A (en) * 2016-08-22 2017-10-03 北京深鉴智能科技有限公司 A kind of hardware accelerator and method that rarefaction GRU neutral nets are realized based on FPGA
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network
CN107704916A (en) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
EP3343392A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Hardware accelerator architecture and template for web-scale k-means clustering
EP3343391A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Heterogeneous hardware accelerator architecture for processing sparse matrix data with skewed non-zero distributions

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336758B (en) * 2013-06-29 2016-06-01 中国科学院软件研究所 The sparse matrix storage means of a kind of employing with the sparse row of compression of local information and the SpMV implementation method based on the method
CN106157339B (en) * 2016-07-05 2019-06-18 华南理工大学 The animated Mesh sequence compaction method extracted based on low-rank vertex trajectories subspace
EP3333720A1 (en) * 2016-12-12 2018-06-13 Thomson Licensing A method and an apparatus for encoding a signal transporting data for reconstructing a sparse matrix

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107155358A (en) * 2012-08-02 2017-09-12 希捷科技有限公司 Combination grain higher level redundancy for nonvolatile memory
CN104636273A (en) * 2015-02-28 2015-05-20 中国科学技术大学 Storage method of sparse matrix on SIMD multi-core processor with multi-level cache
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network
CN107704916A (en) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 A kind of hardware accelerator and method that RNN neutral nets are realized based on FPGA
CN107229967A (en) * 2016-08-22 2017-10-03 北京深鉴智能科技有限公司 A kind of hardware accelerator and method that rarefaction GRU neutral nets are realized based on FPGA
EP3343392A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Hardware accelerator architecture and template for web-scale k-means clustering
EP3343391A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Heterogeneous hardware accelerator architecture for processing sparse matrix data with skewed non-zero distributions
CN106951961A (en) * 2017-02-24 2017-07-14 清华大学 The convolutional neural networks accelerator and system of a kind of coarseness restructural

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于超节点LDL分解的大规模结构计算;赖智超;《计算机辅助工程》;20140430;第23卷(第2期);46-52 *
工程计算中大型稀疏矩阵存储方法研究;纪国良;《数值计算与计算机应用》;20180914;第39卷(第3期);217-230 *

Also Published As

Publication number Publication date
CN109919826A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN109919826B (en) Graph data compression method for graph computation accelerator and graph computation accelerator
WO2018149345A1 (en) Data processing method and device
Ozfatura et al. Gradient coding with clustering and multi-message communication
CN112200300B (en) Convolutional neural network operation method and device
CN105808328A (en) Task scheduling method, device and system
CN108334942A (en) Data processing method, device, chip and the storage medium of neural network
CN109656798B (en) Vertex reordering-based big data processing capability test method for supercomputer
CN112925637A (en) Load balancing device and method for edge operation network
CN103020024A (en) File format converting method
CN109949202B (en) Parallel graph computation accelerator structure
CN109598250A (en) Feature extracting method, device, electronic equipment and computer-readable medium
CN109086879B (en) Method for realizing dense connection neural network based on FPGA
CN105488176A (en) Data processing method and device
CN114676522B (en) Pneumatic shape optimization design method, system and equipment integrating GAN and migration learning
CN102722410B (en) The method of executive routine, server, mobile terminal and system
EP4058943A1 (en) Threshold triggered back propagation of an artificial neural network
US9116751B2 (en) Reconfigurable device, processing assignment method, processing arrangement method, information processing apparatus, and control method therefor
CN116610731B (en) Big data distributed storage method and device, electronic equipment and storage medium
CN108647780A (en) Restructural pond operation module structure towards neural network and its implementation
CN109635034A (en) Training data method for resampling, device, storage medium and electronic equipment
CN117061365A (en) Node selection method, device, equipment and readable storage medium
CN108733739B (en) Operation device and method supporting cluster searching
WO2022188103A1 (en) Data acquisition method and apparatus, calculation device, and storage medium
CN111260070B (en) Operation method, device and related product
CN111258641B (en) Operation method, device and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right

Effective date of registration: 20190618

Address after: West Chang'an Street, Chang'an District, Xi'an City, Shaanxi Province

Applicant after: XI'AN University OF POSTS & TELECOMMUNICATIONS

Applicant after: BOARD OF REGENTS THE University OF TEXAS SYSTEM

Address before: 710121 West Chang'an Street, Chang'an District, Xi'an City, Shaanxi Province

Applicant before: Xi'an University of Posts & Telecommunications

Applicant before: University of Texas at Austin

TA01 Transfer of patent application right
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CI02 Correction of invention patent application

Correction item: Applicant|Address|Applicant

Correct: Xi'an University of Posts and Telecommunications: 710121 West Chang'an Street, Chang'an District, Xi'an City, Shaanxi Province|University of Texas at Austin

False: Xi'an University of Posts & Telecommunications|710121 Xi'an, Shaanxi, Changan District West Chang'an Avenue|Univ. Texas

Number: 27-02

Volume: 35

CI02 Correction of invention patent application
GR01 Patent grant
GR01 Patent grant