CN111652346A - Large-scale map deep learning calculation framework based on hierarchical optimization paradigm - Google Patents
Large-scale map deep learning calculation framework based on hierarchical optimization paradigm Download PDFInfo
- Publication number
- CN111652346A CN111652346A CN202010318058.6A CN202010318058A CN111652346A CN 111652346 A CN111652346 A CN 111652346A CN 202010318058 A CN202010318058 A CN 202010318058A CN 111652346 A CN111652346 A CN 111652346A
- Authority
- CN
- China
- Prior art keywords
- vertex
- data
- gpu
- graph
- edge
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005457 optimization Methods 0.000 title claims abstract description 21
- 238000013135 deep learning Methods 0.000 title claims abstract description 17
- 238000004364 calculation method Methods 0.000 title claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 21
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 230000009466 transformation Effects 0.000 claims abstract description 5
- 230000002776 aggregation Effects 0.000 claims description 14
- 238000004220 aggregation Methods 0.000 claims description 14
- 238000000926 separation method Methods 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 10
- 239000013598 vector Substances 0.000 claims description 9
- 238000009825 accumulation Methods 0.000 claims description 4
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 2
- 230000007246 mechanism Effects 0.000 claims description 2
- 230000001131 transforming effect Effects 0.000 claims description 2
- 230000008030 elimination Effects 0.000 claims 1
- 238000003379 elimination reaction Methods 0.000 claims 1
- 238000005204 segregation Methods 0.000 claims 1
- 238000000844 transformation Methods 0.000 abstract 1
- 238000000034 method Methods 0.000 description 6
- 238000003062 neural network model Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Generation (AREA)
Abstract
The invention provides a large-scale map deep learning calculation framework based on a hierarchical optimization paradigm, and relates to the technical field of deep learning. Including block-based data stream transformations for segmenting vertex and edge data of a graph into tiles; optimizing a data flow graph, wherein the data flow graph is used for generating a scheduling strategy; propagation of kernels on the GPU for efficient kernel propagation operations; and the parallel processing of multiple GPUs is used for parallel calculation of multiple GPUs in one server. The invention supports large-scale Graph Neural Networks (GNNs), can simply express models, and supports scalable and efficient GPUs parallel processing engines. In order to represent the recursive computation of each layer of GNN including graph propagation and Deep Neural Network (DNN) computation, the invention adopts a Neural network (Separation-UDF _ Edge-Aggregation-UDF _ Vertex with Neural Networks, SUAU-NN) Vertex program abstraction with dispersion-Edge application-Aggregation-Vertex application.
Description
Technical Field
The invention belongs to the technical field of deep learning, and particularly relates to a large-scale deep learning calculation framework of a map based on a hierarchical optimization paradigm.
Background
Deep learning in the form of deep neural networks has been successfully applied to many fields, such as speech, vision, natural language processing, etc. In these areas, the representation of the underlying data typically employs a regular grid structure that facilitates hardware acceleration (e.g., GPU) when large amounts of data are in parallel. Driven by the importance of graphic data in social networks, knowledge maps, bioinformatics and neuroscience, the application of deep learning models to data with irregular graphic structures is an emerging trend. Advanced prediction results are used in target applications such as classification, embedding, question-and-answer systems. These graph-based neural networks typically apply a neural network model to the features in the graph that are associated with vertices and edges, and propagate and aggregate the results to generate next-level features.
However, none of the existing solutions support GNNs well, and the existing graphics processing engines typically provide an aggregate + application + Scatter (GAS-appliance-Scatter) vertex program model, which is a neural network framework that cannot express and support graph structures. The deep learning framework is designed as a dataflow graph to express neural networks, such as TensorFlow, PyTorch, MxNet, CNTK, etc., but does not directly support graph propagation models. Furthermore, none of them provide the scalability required to process large-scale graphs, nor do they support efficient implementation of GPU-based graph propagation operators. The current lack of support for these requirements severely limits the ability to fully exploit the potential of large-scale GNNs, while also presenting a significant challenge to the integration of DNNs with large graph structures at the system level.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a large-scale map deep learning calculation framework based on a hierarchical optimization paradigm, which supports large-scale GNNs, can simply express models and supports scalable and efficient GPUs parallel processing engines. The present invention combines data flow and vertex program model abstractions into a new model that allows for the expression of neural network computations in vertex or edge data (processed in tensor form) by using data flow abstraction, rather than abstractions designed for traditional graphics processing (e.g., PageRank, connected components, shortest path, etc. algorithms). To represent the recursive computation of layers of GNN including graph propagation and DNN computation, the present invention employs a program abstraction with SUAU-NN vertices.
The present invention introduces special operators supported by the graphics propagation engine into the dataflow graph and optimizes its execution on the GPU. Therefore, the scheme of the invention is more prone to better improve the memory access efficiency by utilizing the parallelism in each vertex data access.
The technical scheme adopted by the invention is as follows:
a large-scale graph deep learning calculation framework based on a hierarchical optimization paradigm is realized by using a custom operator extension TensorFlow of a vertex program abstraction and a graph propagation process. Compared with TensorFlow supported by a GPU on a small graph, the speed of the method is improved by about 4 times. It includes: block-based data stream conversion, optimization of a data stream graph, propagation of kernels on GPUs, and parallel processing of multiple GPUs, wherein:
the block-based dataflow transformation is used for dividing vertex and edge data of a graph into small blocks, the small blocks are loaded into a GPU (graphics processing unit) equipment memory, and a dataflow graph is constructed together with an operational character for processing block granularity calculation;
the optimization of the dataflow graph is used for generating a scheduling strategy;
propagation of kernels on the GPU is used for efficient kernel propagation operations;
the parallel processing of the multiple GPUs is used for parallel computing of the multiple GPUs in one server.
Preferably, the block-based data stream transformation further comprises transforming an algorithm implemented in the SUAU-NN model into a front end of a block-granularity data flow graph, enabling GNN computation on a large graph in a GPU. The SUAU-NN defines four feed forward computation stages at each layer of the GNN, Separation, UDF _ Edge, Aggregation, and UDF _ Vertex. SUAU-NN provides two user-defined functions (UDFs) for UDF Edge and UDF Vertex, respectively, to declare the computation of the Edge and Vertex of the neural network. The UDF Edge function defines the computation on each Edge, taking as input the Edge, which is an abstraction of the Edge data, and p, which contains the learnable parameters of the GNN model. Each edge is a tuple of tensors [ src, dest, data ] that represents the data of the source and target vertices connected by the edge, as well as the data associated with the edge (e.g., edge weights). The function may apply a neural network model to the edges and p and output an intermediate tensor value associated with the edge. The UDF Vertex function defines a computation on a Vertex, which takes as inputs a tensor data Vertex, a Vertex aggregation accumulator (accum) and a learnable parameter p, and returns new Vertex data by applying a neural network model. The SUAU-NN abstraction builds on top of the dataflow framework, so users can symbolically define dataflow graphs in UDFs by connecting mathematical operations (e.g., add, tanh, softmax, matmul) provided by the underlying framework. The other two phases are Separation and Aggregation, which perform data propagation and prepare data collection for the input of UDF Edge and UDF Vertex. They are triggered and executed implicitly by the system, without requiring the user to provide explicit UDFs;
preferably, the optimization of the dataflow graph further includes minimizing data exchanges between the host and the GPU device memory, and maximizing reuse of data blocks in the GPU memory while overlapping data movement and computations in a streaming manner, and identifying factors for fusion operations and deletion of redundant computations;
preferably, the propagation of kernels on the GPU further comprises supporting stream-based processing, overlapping data movement and computation in the GPU. It also includes a scatter core (Separation core) and an Aggregation core (Aggregation core). The Separation operator passes vertex feature data from source and target to edge. The Aggregation operational character collects the edge characteristic vectors of the target vertexes through an accumulation function provided by a user and reduces the edge characteristic vectors into a single vector of each target vertex;
preferably, the parallel processing of the multiple GPUs further comprises connecting to CPU/DRAM (host memory) through a multi-level PCI-express (pcie) interface hierarchy, implementing a multiple GPU interconnect, and avoiding moving redundant data from the host memory by directly exchanging data blocks between the GPUs using a ring-based streaming mechanism;
after adopting the technical scheme, compared with the background technology, the invention has the following advantages:
firstly, large-scale graph-based neural network model calculation is supported; second, many different types of graph-based neural networks are supported; thirdly, a new program abstraction is provided, is mapped and optimized into a data stream, and can be efficiently executed on the GPU; and fourthly, efficient and scalable graph neural network parallel training is supported.
Drawings
FIG. 1 (a) is a four-stage execution flow of feedforward calculation of the SUAU-NN model, and FIG. 1 (b) is a data flow diagram based on G-GCN hierarchical optimization; FIG. 2 is a diagram of a multi-GPU architecture.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Examples
Referring to fig. 1 and 2, the invention discloses a large-scale map deep learning computation framework based on a hierarchical optimization paradigm, comprising: the method comprises the steps of block-based data flow conversion, optimization of a data flow graph, propagation of an upper kernel of the GPU and parallel processing of multiple GPUs. The large-scale map deep learning implementation is as follows:
1) block-based data stream conversion: to achieve scalability beyond the physical limitations of GPUs, it divides the graph data with vertices and edges attached into blocks, transforms the GNN algorithm represented in the model into a dataflow graph with block-size operators, which allows block-based parallel stream processing to be implemented on a single or multiple GPUs. First, from the Separation stage, the tensor data vertex of each vertex is transferred to the adjacent edge to form edge data, including the source and target vertex data. The subsequent UDF Edge phase calls the parallel computation function defined by UDF Edge on the Edge data, generating an intermediate tensor value for each Edge as output. The Aggregation stage then propagates these outputs along the edges and aggregates them to the target vertices through the exchange and associative accumulation operations. Finally, the UDF _ Vertex phase performs the computations defined in the UDF _ Vertex function on all vertices, with the purpose of generating next-level updated Vertex data. For each GNN layer, the four-stage execution flow of the feedforward calculation of the SUAU-NN model is shown in FIG. 1 (a).
2) Optimizing a data flow graph: since a vertex may have multiple adjacent edges, vertex data may be spread across the edges, making multiple identical multiplications for a vertex is redundant. Therefore, optimization of the dataflow graph will only perform calculations related to source or target vertices, moving from the UDF _ Edge phase of the current layer to the UDF _ Vertex phase of the previous layer. FIG. 1 (b) shows the optimized dataflow graph, moving the matrix multiplication to the UDF _ Vertex phase.
3) Propagation of kernels on the GPU: separation and Aggregation are two key stages in processing the propagation process on sparse edge map structures. To support these two phases efficiently on the GPU, the Separation and Aggregation operation kernel is customized to perform optimized propagation on the GPU. The design takes into account the data structure layout to allow the cores to better utilize the GPU memory hierarchy (e.g., shared memory and registers) and massive parallelism. The Separation operator passes vertex feature data from source and target to edge. In the feed forward calculation, the edges are arranged in a CSC format. Thus, incoming edges for a single target vertex are organized into a group and assigned a thread block to process them. For vertices with larger grids, the edge set is partitioned into contiguous subgroups of thread blocks. In each thread block, the Separation process of the source vertex data is divided into two phases. First, the thread obtains the source vertex ids from the edges in the edge group stored in the adjacent address space and caches them in the shared memory. Second, the threads copy vertex data into the edge data memory in parallel along the dimensions of the vertex feature vector. This way the threads have efficient merged memory access capability. The Aggregation operator collects the edge feature vectors of the target vertices through a user-provided accumulation function and reduces them to a single vector for each target vertex. A thread block co-enumerates edge groups first, accumulates the characteristics of each edge into a temporary vector in a register, and then writes the result back to the corresponding target vertex.
4) Parallel processing of multiple GPUs: in a multi-GPU system, GPU interconnection has a hierarchical structure, and when data is transmitted between different GPUs or between a host and a GPU equipment memory, multi-level locality is generated. As shown in fig. 2, GPU 0 and GPU 1 are connected to the same PCIe switch, and have the highest communication performance. Considering the locality characteristics of a multi-GPU system, a streaming media scheme is elaborately designed so as to achieve higher parallelism and better performance. Meanwhile, a ring-based streaming scheme is designed, so that the GPU is allowed to reuse the data blocks loaded from the memory of the host through direct exchange. The data sharing paths between the GPUs are organized into a ring, as shown in phantom in fig. 2.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (5)
1. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm, comprising: block-based data stream conversion, optimization of a data stream graph, propagation of kernels on GPUs, and parallel processing of multiple GPUs, wherein:
the block-based dataflow transformation is used for dividing vertex and edge data of a graph into small blocks, the small blocks are loaded into a GPU (graphics processing unit) equipment memory, and a dataflow graph is constructed together with an operational character for processing block granularity calculation;
the optimization of the dataflow graph is used for generating a scheduling strategy;
propagation of kernels on the GPU is used for efficient kernel propagation operations;
the parallel processing of the multiple GPUs is used for parallel computing of the multiple GPUs in one server.
2. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the block-based data stream transformation further comprises transforming an algorithm implemented in the SUAU-NN model into a front end of a block-granularity data flow graph, making GNN computation on a large graph in the GPU possible; four feedforward calculation stages defined by SUAU-NN in each layer of GNN, namely Separation, UDF _ Edge, Aggregation and UDF _ Vertex; SUAU-NN provides two user-defined functions (UDFs) for UDF Edge and UDF Vertex, respectively, to declare the computation of the Edge and Vertex of the neural network.
3. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the optimization of the dataflow graph also includes minimizing data exchanges between the host and the GPU device memory, and maximizing reuse of data blocks in the GPU memory while overlapping data movement and computations in a streaming manner, and identifying factors for fusion operations and elimination of redundant computations.
4. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the propagation of kernels on the GPU further comprises supporting stream-based processing, overlapping data movement and computation in the GPU;
it also includes a scatter core (segregation core) and an Aggregation core (Aggregation core);
the Separation operator passes vertex feature data from source and target to edge; the Aggregation operator collects the edge feature vectors of the target vertices through a user-provided accumulation function and reduces them to a single vector for each target vertex.
5. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the multi-GPU parallel processing also includes connecting to CPU/DRAM (host memory) through a multi-level PCI-express (PCIe) interface hierarchy, implementing multi-GPU interconnects, and using a ring-based streaming mechanism to avoid moving redundant data from host memory by exchanging data blocks directly between GPUs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010318058.6A CN111652346A (en) | 2020-04-21 | 2020-04-21 | Large-scale map deep learning calculation framework based on hierarchical optimization paradigm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010318058.6A CN111652346A (en) | 2020-04-21 | 2020-04-21 | Large-scale map deep learning calculation framework based on hierarchical optimization paradigm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111652346A true CN111652346A (en) | 2020-09-11 |
Family
ID=72352514
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010318058.6A Pending CN111652346A (en) | 2020-04-21 | 2020-04-21 | Large-scale map deep learning calculation framework based on hierarchical optimization paradigm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111652346A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022057310A1 (en) * | 2020-09-15 | 2022-03-24 | 华为技术有限公司 | Method, apparatus and system for training graph neural network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190108639A1 (en) * | 2017-10-09 | 2019-04-11 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and Methods for Semantic Segmentation of 3D Point Clouds |
CN110941494A (en) * | 2019-12-02 | 2020-03-31 | 哈尔滨工程大学 | Deep learning-oriented GPU parallel computing data processing method |
-
2020
- 2020-04-21 CN CN202010318058.6A patent/CN111652346A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190108639A1 (en) * | 2017-10-09 | 2019-04-11 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and Methods for Semantic Segmentation of 3D Point Clouds |
CN110941494A (en) * | 2019-12-02 | 2020-03-31 | 哈尔滨工程大学 | Deep learning-oriented GPU parallel computing data processing method |
Non-Patent Citations (2)
Title |
---|
LINGXIAO MA ETC: "Towards Efficient Large-Scale Graph Neural Network Computing" * |
齐金山;梁循;李志宇;陈燕方;许媛;: "大规模复杂信息网络表示学习:概念、方法与挑战" * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022057310A1 (en) * | 2020-09-15 | 2022-03-24 | 华为技术有限公司 | Method, apparatus and system for training graph neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | BoostGCN: A framework for optimizing GCN inference on FPGA | |
Kiningham et al. | GRIP: A graph neural network accelerator architecture | |
Dafir et al. | A survey on parallel clustering algorithms for big data | |
Chen et al. | Rubik: A hierarchical architecture for efficient graph neural network training | |
Chen et al. | On-the-fly parallel data shuffling for graph processing on OpenCL-based FPGAs | |
Sun et al. | Lightweight image classifier using dilated and depthwise separable convolutions | |
Pascucci et al. | The ViSUS Visualization Framework. | |
KR20220034236A (en) | Accelerated embedding layer computations | |
Zhang et al. | A survey on graph neural network acceleration: Algorithms, systems, and customized hardware | |
Jiang et al. | Improving the Performance of Whale Optimization Algorithm through OpenCL‐Based FPGA Accelerator | |
Shi et al. | Versagnn: a versatile accelerator for graph neural networks | |
Chen et al. | Rubik: A hierarchical architecture for efficient graph learning | |
Zhang et al. | Efficient neighbor-sampling-based gnn training on cpu-fpga heterogeneous platform | |
Xia et al. | Redundancy-free high-performance dynamic GNN training with hierarchical pipeline parallelism | |
CN116401502A (en) | Method and device for optimizing Winograd convolution based on NUMA system characteristics | |
Garg et al. | Understanding the design space of sparse/dense multiphase dataflows for mapping graph neural networks on spatial accelerators | |
CN113452655A (en) | Distributed training method, gradient communication device and computing equipment | |
Zhou et al. | FASTCF: FPGA-based accelerator for stochastic-gradient-descent-based collaborative filtering | |
Jiao et al. | PGLBox: Multi-GPU Graph Learning Framework for Web-Scale Recommendation | |
CN111652346A (en) | Large-scale map deep learning calculation framework based on hierarchical optimization paradigm | |
Zhan et al. | Field programmable gate array‐based all‐layer accelerator with quantization neural networks for sustainable cyber‐physical systems | |
Chen et al. | Point Cloud Acceleration by Exploiting Geometric Similarity | |
US11921784B2 (en) | Flexible, scalable graph-processing accelerator | |
Lee et al. | MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units | |
Li et al. | Spnet: Multi-shell kernel convolution for point cloud semantic segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200911 |