CN111652346A

CN111652346A - Large-scale map deep learning calculation framework based on hierarchical optimization paradigm

Info

Publication number: CN111652346A
Application number: CN202010318058.6A
Authority: CN
Inventors: 王彬; 洪万福; 钱智毅
Original assignee: Xiamen Yuanting Information Technology Co ltd
Current assignee: Xiamen Yuanting Information Technology Co ltd
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-09-11

Abstract

The invention provides a large-scale map deep learning calculation framework based on a hierarchical optimization paradigm, and relates to the technical field of deep learning. Including block-based data stream transformations for segmenting vertex and edge data of a graph into tiles; optimizing a data flow graph, wherein the data flow graph is used for generating a scheduling strategy; propagation of kernels on the GPU for efficient kernel propagation operations; and the parallel processing of multiple GPUs is used for parallel calculation of multiple GPUs in one server. The invention supports large-scale Graph Neural Networks (GNNs), can simply express models, and supports scalable and efficient GPUs parallel processing engines. In order to represent the recursive computation of each layer of GNN including graph propagation and Deep Neural Network (DNN) computation, the invention adopts a Neural network (Separation-UDF _ Edge-Aggregation-UDF _ Vertex with Neural Networks, SUAU-NN) Vertex program abstraction with dispersion-Edge application-Aggregation-Vertex application.

Description

Large-scale map deep learning calculation framework based on hierarchical optimization paradigm

Technical Field

The invention belongs to the technical field of deep learning, and particularly relates to a large-scale deep learning calculation framework of a map based on a hierarchical optimization paradigm.

Background

Deep learning in the form of deep neural networks has been successfully applied to many fields, such as speech, vision, natural language processing, etc. In these areas, the representation of the underlying data typically employs a regular grid structure that facilitates hardware acceleration (e.g., GPU) when large amounts of data are in parallel. Driven by the importance of graphic data in social networks, knowledge maps, bioinformatics and neuroscience, the application of deep learning models to data with irregular graphic structures is an emerging trend. Advanced prediction results are used in target applications such as classification, embedding, question-and-answer systems. These graph-based neural networks typically apply a neural network model to the features in the graph that are associated with vertices and edges, and propagate and aggregate the results to generate next-level features.

However, none of the existing solutions support GNNs well, and the existing graphics processing engines typically provide an aggregate + application + Scatter (GAS-appliance-Scatter) vertex program model, which is a neural network framework that cannot express and support graph structures. The deep learning framework is designed as a dataflow graph to express neural networks, such as TensorFlow, PyTorch, MxNet, CNTK, etc., but does not directly support graph propagation models. Furthermore, none of them provide the scalability required to process large-scale graphs, nor do they support efficient implementation of GPU-based graph propagation operators. The current lack of support for these requirements severely limits the ability to fully exploit the potential of large-scale GNNs, while also presenting a significant challenge to the integration of DNNs with large graph structures at the system level.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a large-scale map deep learning calculation framework based on a hierarchical optimization paradigm, which supports large-scale GNNs, can simply express models and supports scalable and efficient GPUs parallel processing engines. The present invention combines data flow and vertex program model abstractions into a new model that allows for the expression of neural network computations in vertex or edge data (processed in tensor form) by using data flow abstraction, rather than abstractions designed for traditional graphics processing (e.g., PageRank, connected components, shortest path, etc. algorithms). To represent the recursive computation of layers of GNN including graph propagation and DNN computation, the present invention employs a program abstraction with SUAU-NN vertices.

The present invention introduces special operators supported by the graphics propagation engine into the dataflow graph and optimizes its execution on the GPU. Therefore, the scheme of the invention is more prone to better improve the memory access efficiency by utilizing the parallelism in each vertex data access.

The technical scheme adopted by the invention is as follows:

a large-scale graph deep learning calculation framework based on a hierarchical optimization paradigm is realized by using a custom operator extension TensorFlow of a vertex program abstraction and a graph propagation process. Compared with TensorFlow supported by a GPU on a small graph, the speed of the method is improved by about 4 times. It includes: block-based data stream conversion, optimization of a data stream graph, propagation of kernels on GPUs, and parallel processing of multiple GPUs, wherein:

the block-based dataflow transformation is used for dividing vertex and edge data of a graph into small blocks, the small blocks are loaded into a GPU (graphics processing unit) equipment memory, and a dataflow graph is constructed together with an operational character for processing block granularity calculation;

the optimization of the dataflow graph is used for generating a scheduling strategy;

propagation of kernels on the GPU is used for efficient kernel propagation operations;

the parallel processing of the multiple GPUs is used for parallel computing of the multiple GPUs in one server.

Preferably, the block-based data stream transformation further comprises transforming an algorithm implemented in the SUAU-NN model into a front end of a block-granularity data flow graph, enabling GNN computation on a large graph in a GPU. The SUAU-NN defines four feed forward computation stages at each layer of the GNN, Separation, UDF _ Edge, Aggregation, and UDF _ Vertex. SUAU-NN provides two user-defined functions (UDFs) for UDF Edge and UDF Vertex, respectively, to declare the computation of the Edge and Vertex of the neural network. The UDF Edge function defines the computation on each Edge, taking as input the Edge, which is an abstraction of the Edge data, and p, which contains the learnable parameters of the GNN model. Each edge is a tuple of tensors [ src, dest, data ] that represents the data of the source and target vertices connected by the edge, as well as the data associated with the edge (e.g., edge weights). The function may apply a neural network model to the edges and p and output an intermediate tensor value associated with the edge. The UDF Vertex function defines a computation on a Vertex, which takes as inputs a tensor data Vertex, a Vertex aggregation accumulator (accum) and a learnable parameter p, and returns new Vertex data by applying a neural network model. The SUAU-NN abstraction builds on top of the dataflow framework, so users can symbolically define dataflow graphs in UDFs by connecting mathematical operations (e.g., add, tanh, softmax, matmul) provided by the underlying framework. The other two phases are Separation and Aggregation, which perform data propagation and prepare data collection for the input of UDF Edge and UDF Vertex. They are triggered and executed implicitly by the system, without requiring the user to provide explicit UDFs;

preferably, the optimization of the dataflow graph further includes minimizing data exchanges between the host and the GPU device memory, and maximizing reuse of data blocks in the GPU memory while overlapping data movement and computations in a streaming manner, and identifying factors for fusion operations and deletion of redundant computations;

preferably, the propagation of kernels on the GPU further comprises supporting stream-based processing, overlapping data movement and computation in the GPU. It also includes a scatter core (Separation core) and an Aggregation core (Aggregation core). The Separation operator passes vertex feature data from source and target to edge. The Aggregation operational character collects the edge characteristic vectors of the target vertexes through an accumulation function provided by a user and reduces the edge characteristic vectors into a single vector of each target vertex;

preferably, the parallel processing of the multiple GPUs further comprises connecting to CPU/DRAM (host memory) through a multi-level PCI-express (pcie) interface hierarchy, implementing a multiple GPU interconnect, and avoiding moving redundant data from the host memory by directly exchanging data blocks between the GPUs using a ring-based streaming mechanism;

after adopting the technical scheme, compared with the background technology, the invention has the following advantages:

firstly, large-scale graph-based neural network model calculation is supported; second, many different types of graph-based neural networks are supported; thirdly, a new program abstraction is provided, is mapped and optimized into a data stream, and can be efficiently executed on the GPU; and fourthly, efficient and scalable graph neural network parallel training is supported.

Drawings

FIG. 1 (a) is a four-stage execution flow of feedforward calculation of the SUAU-NN model, and FIG. 1 (b) is a data flow diagram based on G-GCN hierarchical optimization; FIG. 2 is a diagram of a multi-GPU architecture.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Examples

Referring to fig. 1 and 2, the invention discloses a large-scale map deep learning computation framework based on a hierarchical optimization paradigm, comprising: the method comprises the steps of block-based data flow conversion, optimization of a data flow graph, propagation of an upper kernel of the GPU and parallel processing of multiple GPUs. The large-scale map deep learning implementation is as follows:

1) block-based data stream conversion: to achieve scalability beyond the physical limitations of GPUs, it divides the graph data with vertices and edges attached into blocks, transforms the GNN algorithm represented in the model into a dataflow graph with block-size operators, which allows block-based parallel stream processing to be implemented on a single or multiple GPUs. First, from the Separation stage, the tensor data vertex of each vertex is transferred to the adjacent edge to form edge data, including the source and target vertex data. The subsequent UDF Edge phase calls the parallel computation function defined by UDF Edge on the Edge data, generating an intermediate tensor value for each Edge as output. The Aggregation stage then propagates these outputs along the edges and aggregates them to the target vertices through the exchange and associative accumulation operations. Finally, the UDF _ Vertex phase performs the computations defined in the UDF _ Vertex function on all vertices, with the purpose of generating next-level updated Vertex data. For each GNN layer, the four-stage execution flow of the feedforward calculation of the SUAU-NN model is shown in FIG. 1 (a).

2) Optimizing a data flow graph: since a vertex may have multiple adjacent edges, vertex data may be spread across the edges, making multiple identical multiplications for a vertex is redundant. Therefore, optimization of the dataflow graph will only perform calculations related to source or target vertices, moving from the UDF _ Edge phase of the current layer to the UDF _ Vertex phase of the previous layer. FIG. 1 (b) shows the optimized dataflow graph, moving the matrix multiplication to the UDF _ Vertex phase.

3) Propagation of kernels on the GPU: separation and Aggregation are two key stages in processing the propagation process on sparse edge map structures. To support these two phases efficiently on the GPU, the Separation and Aggregation operation kernel is customized to perform optimized propagation on the GPU. The design takes into account the data structure layout to allow the cores to better utilize the GPU memory hierarchy (e.g., shared memory and registers) and massive parallelism. The Separation operator passes vertex feature data from source and target to edge. In the feed forward calculation, the edges are arranged in a CSC format. Thus, incoming edges for a single target vertex are organized into a group and assigned a thread block to process them. For vertices with larger grids, the edge set is partitioned into contiguous subgroups of thread blocks. In each thread block, the Separation process of the source vertex data is divided into two phases. First, the thread obtains the source vertex ids from the edges in the edge group stored in the adjacent address space and caches them in the shared memory. Second, the threads copy vertex data into the edge data memory in parallel along the dimensions of the vertex feature vector. This way the threads have efficient merged memory access capability. The Aggregation operator collects the edge feature vectors of the target vertices through a user-provided accumulation function and reduces them to a single vector for each target vertex. A thread block co-enumerates edge groups first, accumulates the characteristics of each edge into a temporary vector in a register, and then writes the result back to the corresponding target vertex.

4) Parallel processing of multiple GPUs: in a multi-GPU system, GPU interconnection has a hierarchical structure, and when data is transmitted between different GPUs or between a host and a GPU equipment memory, multi-level locality is generated. As shown in fig. 2, GPU 0 and GPU 1 are connected to the same PCIe switch, and have the highest communication performance. Considering the locality characteristics of a multi-GPU system, a streaming media scheme is elaborately designed so as to achieve higher parallelism and better performance. Meanwhile, a ring-based streaming scheme is designed, so that the GPU is allowed to reuse the data blocks loaded from the memory of the host through direct exchange. The data sharing paths between the GPUs are organized into a ring, as shown in phantom in fig. 2.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm, comprising: block-based data stream conversion, optimization of a data stream graph, propagation of kernels on GPUs, and parallel processing of multiple GPUs, wherein:

2. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the block-based data stream transformation further comprises transforming an algorithm implemented in the SUAU-NN model into a front end of a block-granularity data flow graph, making GNN computation on a large graph in the GPU possible; four feedforward calculation stages defined by SUAU-NN in each layer of GNN, namely Separation, UDF _ Edge, Aggregation and UDF _ Vertex; SUAU-NN provides two user-defined functions (UDFs) for UDF Edge and UDF Vertex, respectively, to declare the computation of the Edge and Vertex of the neural network.

3. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the optimization of the dataflow graph also includes minimizing data exchanges between the host and the GPU device memory, and maximizing reuse of data blocks in the GPU memory while overlapping data movement and computations in a streaming manner, and identifying factors for fusion operations and elimination of redundant computations.

4. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the propagation of kernels on the GPU further comprises supporting stream-based processing, overlapping data movement and computation in the GPU;

it also includes a scatter core (segregation core) and an Aggregation core (Aggregation core);

the Separation operator passes vertex feature data from source and target to edge; the Aggregation operator collects the edge feature vectors of the target vertices through a user-provided accumulation function and reduces them to a single vector for each target vertex.

5. A large-scale map deep learning computation framework based on a hierarchical optimization paradigm is characterized in that: the multi-GPU parallel processing also includes connecting to CPU/DRAM (host memory) through a multi-level PCI-express (PCIe) interface hierarchy, implementing multi-GPU interconnects, and using a ring-based streaming mechanism to avoid moving redundant data from host memory by exchanging data blocks directly between GPUs.