Nothing Special   »   [go: up one dir, main page]

CN111652346A - Large-scale map deep learning calculation framework based on hierarchical optimization paradigm - Google Patents

Large-scale map deep learning calculation framework based on hierarchical optimization paradigm Download PDF

Info

Publication number
CN111652346A
CN111652346A CN202010318058.6A CN202010318058A CN111652346A CN 111652346 A CN111652346 A CN 111652346A CN 202010318058 A CN202010318058 A CN 202010318058A CN 111652346 A CN111652346 A CN 111652346A
Authority
CN
China
Prior art keywords
vertex
data
gpu
graph
edge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010318058.6A
Other languages
Chinese (zh)
Inventor
王彬
洪万福
钱智毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Yuanting Information Technology Co ltd
Original Assignee
Xiamen Yuanting Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Yuanting Information Technology Co ltd filed Critical Xiamen Yuanting Information Technology Co ltd
Priority to CN202010318058.6A priority Critical patent/CN111652346A/en
Publication of CN111652346A publication Critical patent/CN111652346A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Generation (AREA)

Abstract

The invention provides a large-scale map deep learning calculation framework based on a hierarchical optimization paradigm, and relates to the technical field of deep learning. Including block-based data stream transformations for segmenting vertex and edge data of a graph into tiles; optimizing a data flow graph, wherein the data flow graph is used for generating a scheduling strategy; propagation of kernels on the GPU for efficient kernel propagation operations; and the parallel processing of multiple GPUs is used for parallel calculation of multiple GPUs in one server. The invention supports large-scale Graph Neural Networks (GNNs), can simply express models, and supports scalable and efficient GPUs parallel processing engines. In order to represent the recursive computation of each layer of GNN including graph propagation and Deep Neural Network (DNN) computation, the invention adopts a Neural network (Separation-UDF _ Edge-Aggregation-UDF _ Vertex with Neural Networks, SUAU-NN) Vertex program abstraction with dispersion-Edge application-Aggregation-Vertex application.

Description

Large-scale map deep learning calculation framework based on hierarchical optimization paradigm
Technical Field
The invention belongs to the technical field of deep learning, and particularly relates to a large-scale deep learning calculation framework of a map based on a hierarchical optimization paradigm.
Background
Deep learning in the form of deep neural networks has been successfully applied to many fields, such as speech, vision, natural language processing, etc. In these areas, the representation of the underlying data typically employs a regular grid structure that facilitates hardware acceleration (e.g., GPU) when large amounts of data are in parallel. Driven by the importance of graphic data in social networks, knowledge maps, bioinformatics and neuroscience, the application of deep learning models to data with irregular graphic structures is an emerging trend. Advanced prediction results are used in target applications such as classification, embedding, question-and-answer systems. These graph-based neural networks typically apply a neural network model to the features in the graph that are associated with vertices and edges, and propagate and aggregate the results to generate next-level features.
However, none of the existing solutions support GNNs well, and the existing graphics processing engines typically provide an aggregate + application + Scatter (GAS-appliance-Scatter) vertex program model, which is a neural network framework that cannot express and support graph structures. The deep learning framework is designed as a dataflow graph to express neural networks, such as TensorFlow, PyTorch, MxNet, CNTK, etc., but does not directly support graph propagation models. Furthermore, none of them provide the scalability required to process large-scale graphs, nor do they support efficient implementation of GPU-based graph propagation operators. The current lack of support for these requirements severely limits the ability to fully exploit the potential of large-scale GNNs, while also presenting a significant challenge to the integration of DNNs with large graph structures at the system level.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a large-scale map deep learning calculation framework based on a hierarchical optimization paradigm, which supports large-scale GNNs, can simply express models and supports scalable and efficient GPUs parallel processing engines. The present invention combines data flow and vertex program model abstractions into a new model that allows for the expression of neural network computations in vertex or edge data (processed in tensor form) by using data flow abstraction, rather than abstractions designed for traditional graphics processing (e.g., PageRank, connected components, shortest path, etc. algorithms). To represent the recursive computation of layers of GNN including graph propagation and DNN computation, the present invention employs a program abstraction with SUAU-NN vertices.
The present invention introduces special operators supported by the graphics propagation engine into the dataflow graph and optimizes its execution on the GPU. Therefore, the scheme of the invention is more prone to better improve the memory access efficiency by utilizing the parallelism in each vertex data access.
The technical scheme adopted by the invention is as follows:
a large-scale graph deep learning calculation framework based on a hierarchical optimization paradigm is realized by using a custom operator extension TensorFlow of a vertex program abstraction and a graph propagation process. Compared with TensorFlow supported by a GPU on a small graph, the speed of the method is improved by about 4 times. It includes: block-based data stream conversion, optimization of a data stream graph, propagation of kernels on GPUs, and parallel processing of multiple GPUs, wherein:
the block-based dataflow transformation is used for dividing vertex and edge data of a graph into small blocks, the small blocks are loaded into a GPU (graphics processing unit) equipment memory, and a dataflow graph is constructed together with an operational character for processing block granularity calculation;
the optimization of the dataflow graph is used for generating a scheduling strategy;
propagation of kernels on the GPU is used for efficient kernel propagation operations;
the parallel processing of the multiple GPUs is used for parallel computing of the multiple GPUs in one server.
Preferably, the block-based data stream transformation further comprises transforming an algorithm implemented in the SUAU-NN model into a front end of a block-granularity data flow graph, enabling GNN computation on a large graph in a GPU. The SUAU-NN defines four feed forward computation stages at each layer of the GNN, Separation, UDF _ Edge, Aggregation, and UDF _ Vertex. SUAU-NN provides two user-defined functions (UDFs) for UDF Edge and UDF Vertex, respectively, to declare the computation of the Edge and Vertex of the neural network. The UDF Edge function defines the computation on each Edge, taking as input the Edge, which is an abstraction of the Edge data, and p, which contains the learnable parameters of the GNN model. Each edge is a tuple of tensors [ src, dest, data ] that represents the data of the source and target vertices connected by the edge, as well as the data associated with the edge (e.g., edge weights). The function may apply a neural network model to the edges and p and output an intermediate tensor value associated with the edge. The UDF Vertex function defines a computation on a Vertex, which takes as inputs a tensor data Vertex, a Vertex aggregation accumulator (accum) and a learnable parameter p, and returns new Vertex data by applying a neural network model. The SUAU-NN abstraction builds on top of the dataflow framework, so users can symbolically define dataflow graphs in UDFs by connecting mathematical operations (e.g., add, tanh, softmax, matmul) provided by the underlying framework. The other two phases are Separation and Aggregation, which perform data propagation and prepare data collection for the input of UDF Edge and UDF Vertex. They are triggered and executed implicitly by the system, without requiring the user to provide explicit UDFs;
preferably, the optimization of the dataflow graph further includes minimizing data exchanges between the host and the GPU device memory, and maximizing reuse of data blocks in the GPU memory while overlapping data movement and computations in a streaming manner, and identifying factors for fusion operations and deletion of redundant computations;
preferably, the propagation of kernels on the GPU further comprises supporting stream-based processing, overlapping data movement and computation in the GPU. It also includes a scatter core (Separation core) and an Aggregation core (Aggregation core). The Separation operator passes vertex feature data from source and target to edge. The Aggregation operational character collects the edge characteristic vectors of the target vertexes through an accumulation function provided by a user and reduces the edge characteristic vectors into a single vector of each target vertex;
preferably, the parallel processing of the multiple GPUs further comprises connecting to CPU/DRAM (host memory) through a multi-level PCI-express (pcie) interface hierarchy, implementing a multiple GPU interconnect, and avoiding moving redundant data from the host memory by directly exchanging data blocks between the GPUs using a ring-based streaming mechanism;
after adopting the technical scheme, compared with the background technology, the invention has the following advantages:
firstly, large-scale graph-based neural network model calculation is supported; second, many different types of graph-based neural networks are supported; thirdly, a new program abstraction is provided, is mapped and optimized into a data stream, and can be efficiently executed on the GPU; and fourthly, efficient and scalable graph neural network parallel training is supported.
Drawings
FIG. 1 (a) is a four-stage execution flow of feedforward calculation of the SUAU-NN model, and FIG. 1 (b) is a data flow diagram based on G-GCN hierarchical optimization; FIG. 2 is a diagram of a multi-GPU architecture.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Examples
Referring to fig. 1 and 2, the invention discloses a large-scale map deep learning computation framework based on a hierarchical optimization paradigm, comprising: the method comprises the steps of block-based data flow conversion, optimization of a data flow graph, propagation of an upper kernel of the GPU and parallel processing of multiple GPUs. The large-scale map deep learning implementation is as follows:
1) block-based data stream conversion: to achieve scalability beyond the physical limitations of GPUs, it divides the graph data with vertices and edges attached into blocks, transforms the GNN algorithm represented in the model into a dataflow graph with block-size operators, which allows block-based parallel stream processing to be implemented on a single or multiple GPUs. First, from the Separation stage, the tensor data vertex of each vertex is transferred to the adjacent edge to form edge data, including the source and target vertex data. The subsequent UDF Edge phase calls the parallel computation function defined by UDF Edge on the Edge data, generating an intermediate tensor value for each Edge as output. The Aggregation stage then propagates these outputs along the edges and aggregates them to the target vertices through the exchange and associative accumulation operations. Finally, the UDF _ Vertex phase performs the computations defined in the UDF _ Vertex function on all vertices, with the purpose of generating next-level updated Vertex data. For each GNN layer, the four-stage execution flow of the feedforward calculation of the SUAU-NN model is shown in FIG. 1 (a).
2) Optimizing a data flow graph: since a vertex may have multiple adjacent edges, vertex data may be spread across the edges, making multiple identical multiplications for a vertex is redundant. Therefore, optimization of the dataflow graph will only perform calculations related to source or target vertices, moving from the UDF _ Edge phase of the current layer to the UDF _ Vertex phase of the previous layer. FIG. 1 (b) shows the optimized dataflow graph, moving the matrix multiplication to the UDF _ Vertex phase.
3) Propagation of kernels on the GPU: separation and Aggregation are two key stages in processing the propagation process on sparse edge map structures. To support these two phases efficiently on the GPU, the Separation and Aggregation operation kernel is customized to perform optimized propagation on the GPU. The design takes into account the data structure layout to allow the cores to better utilize the GPU memory hierarchy (e.g., shared memory and registers) and massive parallelism. The Separation operator passes vertex feature data from source and target to edge. In the feed forward calculation, the edges are arranged in a CSC format. Thus, incoming edges for a single target vertex are organized into a group and assigned a thread block to process them. For vertices with larger grids, the edge set is partitioned into contiguous subgroups of thread blocks. In each thread block, the Separation process of the source vertex data is divided into two phases. First, the thread obtains the source vertex ids from the edges in the edge group stored in the adjacent address space and caches them in the shared memory. Second, the threads copy vertex data into the edge data memory in parallel along the dimensions of the vertex feature vector. This way the threads have efficient merged memory access capability. The Aggregation operator collects the edge feature vectors of the target vertices through a user-provided accumulation function and reduces them to a single vector for each target vertex. A thread block co-enumerates edge groups first, accumulates the characteristics of each edge into a temporary vector in a register, and then writes the result back to the corresponding target vertex.
4) Parallel processing of multiple GPUs: in a multi-GPU system, GPU interconnection has a hierarchical structure, and when data is transmitted between different GPUs or between a host and a GPU equipment memory, multi-level locality is generated. As shown in fig. 2, GPU 0 and GPU 1 are connected to the same PCIe switch, and have the highest communication performance. Considering the locality characteristics of a multi-GPU system, a streaming media scheme is elaborately designed so as to achieve higher parallelism and better performance. Meanwhile, a ring-based streaming scheme is designed, so that the GPU is allowed to reuse the data blocks loaded from the memory of the host through direct exchange. The data sharing paths between the GPUs are organized into a ring, as shown in phantom in fig. 2.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (5)

1. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm, comprising: block-based data stream conversion, optimization of a data stream graph, propagation of kernels on GPUs, and parallel processing of multiple GPUs, wherein:
the block-based dataflow transformation is used for dividing vertex and edge data of a graph into small blocks, the small blocks are loaded into a GPU (graphics processing unit) equipment memory, and a dataflow graph is constructed together with an operational character for processing block granularity calculation;
the optimization of the dataflow graph is used for generating a scheduling strategy;
propagation of kernels on the GPU is used for efficient kernel propagation operations;
the parallel processing of the multiple GPUs is used for parallel computing of the multiple GPUs in one server.
2. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the block-based data stream transformation further comprises transforming an algorithm implemented in the SUAU-NN model into a front end of a block-granularity data flow graph, making GNN computation on a large graph in the GPU possible; four feedforward calculation stages defined by SUAU-NN in each layer of GNN, namely Separation, UDF _ Edge, Aggregation and UDF _ Vertex; SUAU-NN provides two user-defined functions (UDFs) for UDF Edge and UDF Vertex, respectively, to declare the computation of the Edge and Vertex of the neural network.
3. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the optimization of the dataflow graph also includes minimizing data exchanges between the host and the GPU device memory, and maximizing reuse of data blocks in the GPU memory while overlapping data movement and computations in a streaming manner, and identifying factors for fusion operations and elimination of redundant computations.
4. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the propagation of kernels on the GPU further comprises supporting stream-based processing, overlapping data movement and computation in the GPU;
it also includes a scatter core (segregation core) and an Aggregation core (Aggregation core);
the Separation operator passes vertex feature data from source and target to edge; the Aggregation operator collects the edge feature vectors of the target vertices through a user-provided accumulation function and reduces them to a single vector for each target vertex.
5. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the multi-GPU parallel processing also includes connecting to CPU/DRAM (host memory) through a multi-level PCI-express (PCIe) interface hierarchy, implementing multi-GPU interconnects, and using a ring-based streaming mechanism to avoid moving redundant data from host memory by exchanging data blocks directly between GPUs.
CN202010318058.6A 2020-04-21 2020-04-21 Large-scale map deep learning calculation framework based on hierarchical optimization paradigm Pending CN111652346A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010318058.6A CN111652346A (en) 2020-04-21 2020-04-21 Large-scale map deep learning calculation framework based on hierarchical optimization paradigm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010318058.6A CN111652346A (en) 2020-04-21 2020-04-21 Large-scale map deep learning calculation framework based on hierarchical optimization paradigm

Publications (1)

Publication Number Publication Date
CN111652346A true CN111652346A (en) 2020-09-11

Family

ID=72352514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010318058.6A Pending CN111652346A (en) 2020-04-21 2020-04-21 Large-scale map deep learning calculation framework based on hierarchical optimization paradigm

Country Status (1)

Country Link
CN (1) CN111652346A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022057310A1 (en) * 2020-09-15 2022-03-24 华为技术有限公司 Method, apparatus and system for training graph neural network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190108639A1 (en) * 2017-10-09 2019-04-11 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Semantic Segmentation of 3D Point Clouds
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190108639A1 (en) * 2017-10-09 2019-04-11 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Semantic Segmentation of 3D Point Clouds
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LINGXIAO MA ETC: "Towards Efficient Large-Scale Graph Neural Network Computing" *
齐金山;梁循;李志宇;陈燕方;许媛;: "大规模复杂信息网络表示学习:概念、方法与挑战" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022057310A1 (en) * 2020-09-15 2022-03-24 华为技术有限公司 Method, apparatus and system for training graph neural network

Similar Documents

Publication Publication Date Title
Zhang et al. BoostGCN: A framework for optimizing GCN inference on FPGA
Kiningham et al. GRIP: A graph neural network accelerator architecture
Dafir et al. A survey on parallel clustering algorithms for big data
Chen et al. Rubik: A hierarchical architecture for efficient graph neural network training
Chen et al. On-the-fly parallel data shuffling for graph processing on OpenCL-based FPGAs
Sun et al. Lightweight image classifier using dilated and depthwise separable convolutions
Pascucci et al. The ViSUS Visualization Framework.
KR20220034236A (en) Accelerated embedding layer computations
Zhang et al. A survey on graph neural network acceleration: Algorithms, systems, and customized hardware
Jiang et al. Improving the Performance of Whale Optimization Algorithm through OpenCL‐Based FPGA Accelerator
Shi et al. Versagnn: a versatile accelerator for graph neural networks
Chen et al. Rubik: A hierarchical architecture for efficient graph learning
Zhang et al. Efficient neighbor-sampling-based gnn training on cpu-fpga heterogeneous platform
Xia et al. Redundancy-free high-performance dynamic GNN training with hierarchical pipeline parallelism
CN116401502A (en) Method and device for optimizing Winograd convolution based on NUMA system characteristics
Garg et al. Understanding the design space of sparse/dense multiphase dataflows for mapping graph neural networks on spatial accelerators
CN113452655A (en) Distributed training method, gradient communication device and computing equipment
Zhou et al. FASTCF: FPGA-based accelerator for stochastic-gradient-descent-based collaborative filtering
Jiao et al. PGLBox: Multi-GPU Graph Learning Framework for Web-Scale Recommendation
CN111652346A (en) Large-scale map deep learning calculation framework based on hierarchical optimization paradigm
Zhan et al. Field programmable gate array‐based all‐layer accelerator with quantization neural networks for sustainable cyber‐physical systems
Chen et al. Point Cloud Acceleration by Exploiting Geometric Similarity
US11921784B2 (en) Flexible, scalable graph-processing accelerator
Lee et al. MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units
Li et al. Spnet: Multi-shell kernel convolution for point cloud semantic segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200911